Tl;Dr: I am working on a large-scale machine learning project to distinguish music from random noise using midi. I need to separate percussion tracks from non-percussion tracks.
I have a massive MIDI database of ~130,000 songs - downloadable here if you want to follow along.
Each of these MIDI files contain an arbitrary number of tracks. Some of these tracks are instrumental (non-percussion); other tracks are percussion. For my purposes, I want to take the MIDI tracks in a given song and write all NON-PERCUSSION tracks to a single instrumental output track. I have no idea how to differentiate between instrumental (those which sound a distinguishable pitch, like Piano, or Trumpet, but also Vibraphone and the like) tracks and percussion (those which sound non-distingushable pitch, like snare or bass drum) tracks in an automated manner. Going through tracks manually is not an option due to sheer volume; furthermore, some tracks in the database are erroneously titled (sample titles: 'Violin - Solo', but also stuff like 'track 17' and even glitchy stuff like '* Music Energy II GM Data, Music Channel BBS'). Due to the way my midi library works, some of these tracks do not contain any notes at all. This would all imply that whatever I implement, it'll have to distinguish between the actual notes in the track, not the title of the track.
Clearly Sibelius can perform this differentiation, because whenever I open a MIDI file with it the percussion tracks are correctly mapped to percussion sounds, instead of default instruments.
What am I missing here?
I am using the Python MIDI library MIDO to traverse the database and read the midi files, and the Python MIDI library MIDIutil to write midi files.
I can post my code in the comments if requested.