Fork of Liu Feng's CoverHunter project. Goals: Make it run, and run fast, on any platform. Document it better. And build it out as a useful toolset for music research generally.
See https://ar5iv.labs.arxiv.org/html/2306.09025 for the July 2023 research paper that accompanied the original CoverHunter repo.
- Either:
- Apple computer with an Apple M-series chip
- Other computer with an Nvidia GPU (including free cloud options like Google Colab)
- python3 (minimum version 3.10, tested on 3.11) with these libraries:
- torch torchaudio (compiled with CUDA or MPS or other GPU support as appropriate for your hardware)
- librosa
- nnAudio
- Optional: tensorboard
- sox
Either:
-
Clone this repo or download it to a folder on your computer. Run the following OS command-line commands from that folder. These commands assume you have a Unix/Linux/MacOS environment, but Windows equivalents exist - see issue #10.
-
Or run this project in Google Colab, using this Colab notebook: https://colab.research.google.com/drive/1HKVT3_0ioRPy7lrKzikGXXGysxZrHpOr?usp=sharing
Follow the example of the prepared Covers80 dataset included with the original CoverHunter. Directions here are for using that prepared data. See also the "dataset.txt format" heading below.
- Download and extract the contents of the
covers80.tgz
file from http://labrosa.ee.columbia.edu/projects/coversongs/covers80/ - Abandon the 2-level folder structure that came inside the covers80.tgz file, flattening so all the .mp3 files are in the same folder. One way to do this is:
- In Terminal, go to the extracted
coversongs
folder as the current directory. Then: cd covers32k && mv */* .; rm -r */
- In Terminal, go to the extracted
- Convert all the provided .mp3 files to .wav format. One way to do this is:
setopt EXTENDED_GLOB; for f in *.mp3; do sox "$f" -r 16000 "${f%%.mp3}.wav" && rm "$f"; done
- Move all these new .wav files to a new folder called
wav_16k
in the project'sdata/covers80
folder. - You can delete the rest of the downloaded
covers80.tgz
contents.
You must run this before proceeding to the Train step. And you can't run this without first doing the Data Preparation step above.
From the project root folder, run:
python3 -m tools.extract_csi_features data/covers80/
See "Input and Output Files" below for more information about what happens here.
CoverHunter includes a prepared configuration to run a training session on the Covers80 dataset located in the 'egs/covers80' subfolder of the project. Important note: the default configuration that the CoverHunter authors provided was a nonsense or toy configuration that only demonstrated that you have a working project and environment. It used the same dataset for both training and validation, so by definition it rapidly converged and overfit.
This fork added a train/validate/test data-splitting function in the extract_csi_features tool, along with corresponding new training hyperparameters. Note that CoverHunter used the terms "train / train-sample / dev" for train / validate / test.
Specify the path where the training hyperparameters are available (in this case using the provided example for covers80) and where the model output will go, as the one required command-line parameter:
python -m tools.train egs/covers80/
This fork also added an optional --runid
parameter so you can distinguish your training runs in TensorBoard in case you are experimenting:
python -m tools.train egs/covers80/ --runid 'first try'
To see the TensorBoard live visualization of the model's progress during training, run this in a separate terminal window, from the root of the project folder, and then use the URL listed in the output to watch the TensorBoard:
tensorboard --logdir=egs/covers80/logs
Optionally edit the hparams.yaml
configuration file in the folder egs/covers80/config
before starting a training run. If you run into memory limits, start with decreasing the batch size from 64 to 32.
This fork added the hyperparameter early_stopping_patience
to support the added feature of early stopping (original CoverHunter defaulted to 10,000 epochs!).
Note: Don't use the torchrun
launch command offered in original CoverHunter. In the single-computer Apple Silicon context, it is not only irrelevant, it actually slows down performance. In my tests it slowed down tools.train performance by about 20%.
The training script's output consists of checkpoint files and embedding vectors, described below in the "Training checkpoint output" section.
This script evaluates your trained model by providing mAP and MR1 metrics and an optional t-SNE clustering plot (compare Fig. 3 in the CoverHunter paper).
- Have a pre-trained CoverHunter model's output checkpoint files available. You only need your best set (typically your highest-numbered one). If you use original CoverHunter's pre-trained model from https://drive.google.com/file/d/1rDZ9CDInpxQUvXRLv87mr-hfDfnV7Y-j/view), unzip it, and move it to a folder that you specify in step 3 below.
- Run your query data through
extract_csi_features.py
. In thehparams.yaml
file for the feature extraction, turn off all augmentation. Seedata/covers80_testset/hparams.yaml
for an example configuration to treat covers80 as the query data:
python3 -m tools.extract_csi_features data/covers80_testset
The important output from that isfull.txt
and thecqt_feat
subfolder's contents. - Run the evaluation script. This example assumes you are using the trained model you created in
egs/covers80
and you want to use all the optional features I added in this fork:
python3 -m tools.eval_testset egs/covers80 data/covers80_testset/full.txt data/covers80_testset/full.txt -plot_name="egs/covers80/tSNE.png" -dist_name='distmatrix' -test_only_labels='data/covers80/dev-only-song-ids.txt'
CoverHunter only shared an evaluation example for the case when query and reference data are identical, presumably to do a self-similarity evaluation of the model. But there is an optional 4th parameter for query_in_ref_path
that would be relevant if query and reference are not identical. See the "query_in_ref" heading below under "Input and Output Files."
The optional plot_name
argument is a path or just a filename where you want to save the t-SNE plot output. If you provide just a filename, model_dir
will be used as the path. See example plot below. Note that your query and reference files must be identical to generate a t-SNE plot (to do a self-similarity evaluation).
The optional test_only_labels
argument is a path to the text file generated by extract_csi_features.py
if its hyperparameters asked for some song_ids to be reserved exclusively for the test aka "dev" dataset. The t-SNE plot will then mark those for you to see how well your model can cluster classes (song_ids) it has never seen before.
This figure shows the results of training from scratch on the covers80 dataset with a train/val/test split of 8:1:1 and 3 classes (song_ids) reserved exclusively for the test dataset.
The optional dist_name
argument is a path where you want to save the distance matrix and ref labels so that you can study the results separately, such as perhaps doing custom t-SNE plots, etc.
See the "Training checkpoint output" section below for a description of the embeddings saved by the eval_for_map_with_feat()
function called in this script. They are saved in a new subfolder of the pretrained_model
folder named embed_NN_tmp
where NN is the highest-numbered epoch subfolder in the pretrained_model
folder.
After you have trained a model and run the evaluation script, you can use the model to identify any music you give it. Provide the music input to the tools.identify.py script by creating a one-line text file that has the metadata about the music, following the format of the text files generated by tools.extract_csi_features.py. For example, you could select any of the entries in the data/covers80/full.txt file, like a speed-augmented version of one of the 80 songs.
Example for covers80:
python -m tools.identify egs/covers80 target.txt -top=10
To interpret the output, use the data/covers80/song_id.map text file to see which song_id goes with which song title. Good news: even the bare-bones demo of training from scratch on covers80 shows that CoverHunter does a very good job of identifying versions of these 80 pop songs.
CoverHunter did not include an implementation of the coarse-to-fine alignment training described in the research paper. (Liu Feng confirmed to me that his employer considers it proprietary technology). See issue #1. But it did include this script which apparently could be useful as part of an implementation we could build ourselves. The command to launch the alignment script that CoverHunter included is:
python3 -m tools.alignment_for_frame pretrained_model data/covers80/full.txt data/covers80/alignment.txt
Arguments to pass to the script:
- Folder containing a pretrained model. For example if you use original CoverHunter's model from https://drive.google.com/file/d/1rDZ9CDInpxQUvXRLv87mr-hfDfnV7Y-j/view), unzip it, and move it to a folder that you rename to
pretrained_model
at the top level of your project folder. That folder in turn must contain apt_model
subfolder that contains the do_000[epoch] and g_000[epoch] checkpoint files. - The output file from the feature-extraction script described above. It must include
song_id
key-values for eachrec
(unlike the rawdataset.txt
file that CoverHunter provided for covers80). - The
alignment.txt
file will receive the output of this script.
There are two different hparams.yaml files, each used at different stages.
- The one located in the folder you provide on the command line to tools.extract_csi_features.py is used only by that script.
key | value |
---|---|
add_noise | Original CoverHunter provided the example of: { prob : 0.75,sr : 16000,chunk : 3,name : "cqt_with_asr_noise",noise_path : "dataset/asr_as_noise/dataset.txt"} However, the CoverHunter repo did not include whatever might supposed to be in "dataset/asr_as_noise/dataset.txt" file nor does the CoverHunter research paper describe it. If that path does not exist in your project folder structure, then tools.extract_csi_features will just skip the stage of adding noise augmentation. At least for training successfully on Covers80, noise augmentation doesn't seem to be needed. |
aug_speed_mode | list of ratios used in tools.extract_csi_features for speed augmention of your raw training data. Example: [0.8, 0.9, 1.0, 1.1, 1.2] means use 80%, 90%, 100%, 110%, and 120% speed variants of your original audio data. |
train-sample_data_split | percent of training data to reserve for validation aka "train-sample" expressed as a fraction of 1. Example for 10%: 0.1 |
train-sample_unseen | percent of song_ids from training data to reserve exclusively for validation aka "train-sample" expressed as a fraction of 1. Example for 2%: 0.02 |
test_data_split | percent of training data to reserve for test aka "dev" expressed as a fraction of 1. Example for 10%: 0.1 |
test_data_unseen | percent of song_ids from training data to reserve exclusively for test aka "dev" expressed as a fraction of 1. Example for 2%: 0.02 |
- The one located in the "config" subfolder of the path you provide on the command line to tools.train uses all the other parameters listed below during training.
key | value |
---|---|
covers80 | Test dataset for model evaluation purposes. "covers80" is the only example provided with the original CoverHunter. Subparameters: query_path : "data/covers80/full.txt"ref_path : "data/covers80/full.txt"every_n_epoch_to_dev : 1 # validate after every n epochThese can apparently be the same path as train_path for doing self-similarity evaluation. |
dev_path | Compare train_path and train_sample_path . This dataset is used in each epoch to run the same validation calculation as with the train_sample_path . But these results are used for the early_stopping_patience calculation. Presumably one should include both classes and samples that were excluded from both train_path and train_sample_path . |
query_path | TBD: can apparently be the same path as train_path . Presumably for use during model evaluation and inference. |
ref_path | TBD: can apparently be the same path as train_path . Presumably for use during model evaluation and inference. |
train_path | path to a JSON file containing metadata about the data to be used for model training (See full.txt below for details) |
train_sample_path | path to a JSON file containing metadata about the data to be used for model validation. Compare dev_path above. Presumably one should include a balanced distribution of samples that are not included in the train_path dataset, but do include samples for the classes represented in the train_path dataset.(See full.txt below for details) |
key | value |
---|---|
batch_size | Usual "batch size" meaning in the field of machine learning. An important parameter to experiment with. |
chunk_frame | list of numbers used with mean_size . CoverHunter's covers80 config used [1125, 900, 675]. "chunk" references in this training script might be the chunks described in the time-domain pooling strategy part of their paper, not the chunks discussed in their coarse-to-fine alignment strategy. Or mabye chunk_frame actually is referring only to the smaller "frames" described in the time-domain pooling strategy part of their paper. See chunk_s. |
chunk_s | duration of a chunk_frame in seconds. Apparently you are supposed to manually calculate chunk_s = chunk_frame / frames-per-second * mean_size . I'm not sure why the script doesn't just calculate this itself using CQT hop-size to get frames-per-second? |
cqt: hop_size: | Fine-grained time resolution, measured as duration in seconds of each CQT spectrogram slice of the audio data. CoverHunter's covers80 setting is 0.04 with a comment "1s has 25 frames". 25 frames per second is hard-coded as an assumption into CoverHunter in various places. |
data_type | "cqt" (default) or "raw" or "mel". Unknown whether CoverHunter actually implemented anything but CQT-based training |
device | 'mps' or 'cuda', corresponding to your GPU hardware and PyTorch library support. Theoretically 'cpu' could work but untested and probably of no value. |
early_stopping_patience | how many epochs to wait for validation loss to improve before early stopping |
mean_size | See chunk_s above. An integer used to multiply chunk lengths to define the length of the feature chunks used in many stages of the training process. |
mode | "random" (default) or "defined". Changes behavior when loading training data in chunks in AudioFeatDataset. "random" described in CoverHunter code as "cut chunk from feat from random start". "defined" described as "cut feat with 'start/chunk_len' info from line" |
m_per_class | From CoverHunter code comments: "m_per_class must divide batch_size without any remainder" and: "At every iteration, this will return m samples per class. For example, if dataloader's batch-size is 100, and m = 5, then 20 classes with 5 samples iter will be returned." |
spec_augmentation | spectral(?) augmentation settings, used to generate temporary data augmentation on the fly during training. CoverHunter settings were:random_erase :prob : 0.5erase_num : 4roll_pitch :prob : 0.5shift_num : 12 |
key | value |
---|---|
embed_dim | 128 |
encoder | # model-encode Subparameters: attention_dim : 256 # "the hidden units number of position-wise feed-forward"output_dims : 128num_blocks : 6 # number of decoder blocks |
input_dim | 96 |
A JSON formatted or tab-delimited key:value text file (see format defined in the utils.py::line_to_dict() function) expected by extract_csi_features.py that describes the training audio data, with one line per audio file.
key | value |
---|---|
rec | Unique identifier. Abbreviation for "recording." CoverHunter originally used "utt" throughout, borrowing the term "utterance" from speech-recognition ML work which is where much of their code was adapted from. Example "cover80_00000000_0_0". In a musical context we could call this a "performance" rather than "utterance," but "recording" is objectively more accurate in this context, and envisions that this project might find use outside of musicology, too. |
wav | relative path to the raw audio file. Example: "data/covers80/wav_16k/annie_lennox+Medusa+03-A_Whiter_Shade_Of_Pale.wav" |
dur_s | duration of the audio file in seconds. Example 316.728 |
song | title of the song. Example "A_Whiter_Shade_Of_Pale" The _add_song_id() function in extract_csi_features assumes that this string is a unique identifier for the parent cover song (so it can't handle musically distinct songs that happen to have the same title). |
version | Not used by CoverHunter. Example "annie_lennox+Medusa+03-A_Whiter_Shade_Of_Pale.mp3" |
full.txt is the JSON-formatted training data catalog for tools.train.py, generated by tools.extract_csi_features. In case you do your own data prep instead of using tools.extract_csi_features, here's the structure of full.txt.
key | value |
---|---|
rec | (see dataset.txt) |
wav | (see dataset.txt) |
dur_s | (see dataset.txt) |
song | (see dataset.txt) |
version | (see dataset.txt) |
feat | path to the CQT features of this rec stored as .npy array. Example: "data/covers80/cqt_feat/sp_0.8-cover80_00000146_71_0.cqt.npy" |
feat_len | output of len(np.load(feat)). Example: 9198 |
song_id | internal, arbitrary unique identifier for the song. This is what teaches the model which recs (recordings) are considered by humans to be the "same song." Example: 0 |
version_id | internal, arbitrary unique identifier for each artificially augmented variant of the original rec (performance). Example: 0 |
Note: Original CoverHunter omitted the unmodified audio by accident due to a logic error at lines 104-112 of tools.extract_csi_features, by unintentionally appending the next value of sp_rec
to the beginning of local_data['rec']
. And if and only if the '1.0' member of the aug_speed_mode hyperparameter was not listed first, the result then was not only that the 1.0 variant was omitted, but also a duplicate copy of the 90% variant was created and included in the final output of full.txt in the end, both entries in full.txt pointing to the same cqt.npy file, just with different version_id values.
That bug didn't prevent successful training, but fixing the bug did, until I discovered that because then, when the model was being fed the intended number of song versions (augmented from 1 to 5 instead of to 4 versions), CoverHunter's preset batch size of 16 became a barrier to success. Increasing the batch size hyperparameter to 32 and larger made a huge difference, resulting in much faster convergence and higher mAP than the original CoverHunter code.
Text file crosswalk between "song" (unique identifying string per song) and the "song_id" number arbitrarily assigned to each "song" by the extract_csi_features.py script. Not used by any scripts in this project currently, but definitely useful as a reference for human interpretation of training results.
filename | comments |
---|---|
cqt_feat subfolder | Numpy array files of the CQT data for each file listed in full.txt. Needed by train.py |
data.init.txt | Copy of dataset.txt after sorting by rec and de-duping. Not used by train.py |
dev.txt | A subset of full.txt generated by the _split_data_by_song_id() function intended for use by train.py as the dev dataset. |
dev-only-song-ids.txt | Text file listing one song_id per line for each song_id that the train/val/test splitting function held out from train/val to be used exclusively in the test aka "dev" dataset. This file can be used by eval_testset.py to mark those samples in the t-SNE plot. |
full.txt | See above detailed description. Contains the entire dataset you provided in the input file. |
song_id.map | Text file, with 2 columns per line, separated by a space, sorted alphabetically by the first column. First column is a distinct "song" string taken from dataset.txt. Second column is the song_id value assigned to that "song." |
sp_aug subfolder | Sox-modified wav speed variants of the raw training .wav files, at the speeds defined in hparams.yaml. Not used by train.py |
sp_aug.txt | Copy of data.init.txt but with addition of 1 new row for each augmented variant created in sp_aug/*.wav. Not used by train.py. |
train.txt | A subset of full.txt generated by the _split_data_by_song_id() function intended for use by train.py as the train dataset. |
train-sample.txt | A subset of full.txt generated by the _split_data_by_song_id() function intended for use by train.py as the train-sample dataset. |
Original CoverHunter also generated the following files, but were not used by their published codebase, so I commented out those functions:
| filename | comments | | song_id_num.map | Text file, not used by train.py, maybe not by anything else? | | song_name_num.map | Text file, not used by train.py, maybe not by anything else? |
Using the default configuration, training saves checkpoints after each epoch in the egs/covers80 folder.
The pt_model
subfolder gets two files per epoch: do_000000NN and g_000000NN where NN=epoch number. The do_ files contain the AdamW optimizer state. The g_ files contain the model's state dictionary. "g" might be an abbreviation for "generator" given that a transformer architecture is involved?
The eval_for_map_with_feat()
function, called at the end of each epoch, also saves data in a separate new subfolder for each epoch, named epoch_NN_covers80. This in turn gets a query_embed
subfolder containing the model-generated embeddings for every sample in the training data, plus the embeddings for time-chunked sections of those samples, named with a suffix of ...__start-N.npy where N is the timecode in seconds of where the chunk starts. The saved embeddings are 1-dimensional arrays containing 128 double-precision (float64) values between -1 and 1. The epoch_NN_covers80 folder also gets an accompanying file query.txt
(with an identical copy as ref.txt
) which is a text file listing the attributes of every training sample represented in the query_embed
subfolder, following the same format as described above for full.txt
.
The file you can prepare for the tools/eval_testset.py
script to pass as the 4th parameter query_in_ref_path
(CoverHunter did not provide an example file or documentation) assumes:
- JSON or tab-delimited key:value format
- The only line contains a single key "query_in_ref" with a value that is itself a list of tuples, where each tuple represents a mapping between an index in the query input file and an index in the reference input file.
This mapping is only used by the
_generate_dist_matrix()
function. That function explains: "List[(idx, idy), ...], means query[idx] is in ref[idy] so we skip that when computing mAP."
Hand-made visualization of how core functions of this project interact with each other. Also includes additional beginner-friendly or verbose code-commenting that I didn't add to the project code.