The widespread use of electronic musical instruments has driven the need for a digital representation of musical notation. The MIDI (Musical Instrument Digital Interface) system has evolved over the years as the standard notation format for modern electronic instruments, and can store and process instrument patch settings as well as performance details and metadata. We wished to use two deep learning models (LSTM and CNN) to determine whether the music of four composers (Bach, Beethoven, Mozart and Chopin) could be successfully classified using only information contained in MIDI musical files from a Kaggle dataset. Such a task requires careful feature selection, identifying which musical components distinguish a composer's characteristic musical styles. Visualization of the differences based on these features helped to confirm the utility of the chosen parameters. The pretty_midi library was used to extract the selected features, and we piloted our efforts using a Support Vector Machine model, which had been used with success previously, and achieved 90% accuracy using the radial basis function kernel. We trained an LSTM model and a CNN model, and achieved 85% accuracy with our LSTM model and 91% accuracy with the CNN model. The implications of our findings are discussed.
For centuries, musical information has long been recorded using a standardized notation with musical notes written on lines of staff, and with special instructions written alongside to indicate changes in note velocity, loudness or softness, accents, tempo and other direction. This has long sufficed for human performers to interpret performance on traditional musical instruments. However, with the development of electronic musical instruments, orthodox musical notation was inadequate to capture the number of parameters needed to specify all the nuances of performance that these new instruments were capable of delivering. Synthesizer configuration, for example, was best suited to process data formatted digitally, much as document formats were developed for word processors.
Rather than storing digital representations of the audio spectrum, a communications protocol was developed that provided a digital interface, but which controllers could send activating signals to a compatible musical instrument, directing it to play notes at a certain pitch, velocity, duration, and other characteristics. This is the Musical Instrument Digital Interface (MIDI) protocol that was developed in the early 1980s, primarily to store settings from a burgeoning array of complex digital electronic instruments that were no longer using patch cords to modulate and channel sound creation. It was envisioned that these instruments would read and respond to digital information, and that these instructions could be stored in a convenient file format. The information could then be used to instantly configure a musical instrument as designed by the musician. Indeed, a key motivation for the development of MIDI was to enable live performance implementations on digital electronic instruments, such as synthesizers. The protocol has undergone revisions and has adapted to meet the demands of newer digital instruments, which were capable of touch sensitivity, polyphony and greater musical expression, such as sostenuto, after-touch, pitch-bend, and many other features.
MIDI files are structured as “chunks” with the basic chunks being the Header chunk and the Track chunk. There are three file formats:
Format 0 has a header chunk followed by one track chunk.
Format 1 allows for one or more tracks of a MIDI sequence.
Format 2 indicates one or more sequentially independent single-track patterns.
The header chunk starts with four bytes, MThd (4D 54 68 64) which denotes the beginning of the file. This is followed by six bytes (00 00 00 06 ff ff), then two bytes denoting the MIDI file format discussed above, then two bytes indicating the number of track chunks in the file, and two bytes denoting the number of ticks per quarter-note.
This chunk is much more complex, and is where the actual data is stored. Each track chunk consists of an initial four byte header, MTrk (4D 54 72 6B), followed by MIDI events, sysex events and meta events. Each of these contain a wealth of information.
MIDI events contain detailed instructions on when a note is on, or off, the key pressure, control change, program change (instrument to be played), channel pressure, channel mode messages, and so on. This defines the basic characteristics of the keyboard interaction.
Sysex events are MIDI exclusive information, such as song selection, timing information, tuning requests, resets, etc.
Meta events contains meta-information about the music, such as copyright information, lyrics, time signature, instrument name, track name, cue information for staging events, etc.
The task of classifying each MIDI file to the composer whose musical piece it describes, comes down to being able to extract information for the MIDI events that correspond to features that humans would use to identify a composer. These include musical complexity in the melodic line, harmonies, chord progressions, polyrhythm, variations in tempo, the use of counterpoint, ostinato, pedal point, the use of certain stylistic cadences, the selection of instruments, and changes in time and key signatures, and overall structure (sonata form, fugue, dance, etc).
We will use selected features from the MIDI files in the Kaggle dataset to classify amongst four composers: Bach, Beethoven, Mozart and Chopin. Although their musical styles are distinctive prima facie to a human listener, there are some stylistic similarities amongst these four that might prove to be a challenge to machine learning algorithms.
Bach’s music is classified as Baroque-era music. His music is notable for heavy use of counterpoint, with multiple independent melodic lines playing together. Harmonic progressions contained dissonances in progression, creating tension and anticipation. Melodic lines were often dense, and would carry across various voices. Bach made liberal use of Baroque ornamentation, such as mordents, trills, and appoggiaturas. Variations in tempo were not as pronounced as with other composers. Most of his music was composed for chamber groups, and even the Brandenburg Concertos were intended for a smaller orchestral ensemble. Bach used the Phrygian and Plagal cadences.
Beethoven’s early music was in the classical period but his later period overlapped with the early Romantic era. His music was notable for dramatic contrasts in mood, with serene lyrical moments often followed by passionate and intense passages. He tended to follow the sonata form in many of his compositions. His symphonic compositions were rich and lush and his orchestration demanded a large ensemble. He often composed programmatic music, such as with his symphonies. His piano sonatas often used ostinato and did not utilize counterpoint as did Bach. Use of ornamentation was sparse.
Mozart also composed during the classical period, however his music emphasized beauty, with more elegant melodic and singing lines, but his works include compositions of an intensely spiritual nature, such as the Missa Solemnis. He also followed the sonata form, and emphasis was placed on balance and symmetry. Mozart composed for piano, chamber and orchestral ensembles. Mozart frequently used trill cadences, as did other classicists.
Chopin composed during the Romantic period, and his compositions were largely for the piano. As such, his music emphasized virtuosity and presented challenges to the performer, such as polyrhythms, large ranges in the melody lines, complexity in harmonic voicing and texture. Ornamentation was largely in the form of technically difficult fiorituras. Dramatic changes in tempo are common in his pieces, and sections are clearly delineated in his pieces where different melodic statements and harmonies contrast with that in other sections. In contrast to the other three composers, performers often use rubato to increase the emotional intensity of the lyrical melodies.
The challenge in classifying MIDI files to exploit these stylistic differences requires being able to extract the necessary information from the MIDI events section of the Track chunk.
The dataset was obtained from Kaggle (Kaggle, 2019). The dataset consisted of the works of 175 composers in 3929 MIDI files. The musical selections encompassed piano, chamber and orchestral works.
We first checked for files with missing or duplicated data, and did not identify any such files with these deficiencies. From the dataset, we selected folders of the four composers listed above, and discarded the remainder of the dataset. The number of works by the four composers was unequal, with the most number of compositions being by Bach (1024 files), followed by Mozart (255 files), then Beethoven (212 files), then Chopin (136 files).
To extract the features from the MIDI files, we used the pretty_midi Python library designed to read and manipulate MIDI files (craffel, 2023a). This library can also be used to read and extract a wealth of features from the MIDI files (craffel, 2023b). These include detection of parameters related to tempo, key signature, instrument selection and changes, note duration, note velocity, pitch-transition and pitch-bend, sustain pedal dynamics, as well as numerous other parameters pertinent to performance-related needs and specialty instruments.
To access these features, one first creates a PrettyMIDI object, such as follows:
midi_data = pretty_midi.PrettyMIDI(midi_file)
From this, one creates midi_data.instruments, which contains information about note duration and pitch classes. Key signature and changes can be accessed with midi_data.key_signature_changes.
Another particularly helpful function is piano roll, which contains information on note velocity, pitch variance & transitions, rhythmic density, and chord density.
piano_roll = pretty_midi.get_piano_roll()
With these functions, we collected information about the following parameters:
● Note density
● Pitch variance
● Velocity statistics (mean, max, variance)
● Polyphony
● Rhythmic density
● Average pitch interval
● Chord density
● Pitch transitions
● Average note duration (for sustained notes)
We investigated whether these parameters would allow for sufficient discrimination between the different composer styles, such that a deep learning model could be trained to classify based on these features.
On the first run of the feature extraction function, we identified 18 files which were poorly formed, either lacking the proper headers, list indexes out of range, or where the key signature information was invalid (see Figure 1).
Figure 1: Errors generated on MIDI file processing on the raw Kaggle dataset.
These errant files were manually deleted, and the analysis was performed on the remaining MIDI files without error.
Train-test splitting was performed using the Sci-kit Learn train_test_split() function. As one can see in Figure 2, the range of values is large, and therefore Standard Scaling was applied to normalize the data before training.
Figure 2: Scatter plot of values in X_train and X_test.
As seen above, the number of MIDI files associated with each composer varied greatly, and this can be seen especially in the training set, where there was an almost 8-fold difference between Bach and Chopin. An imbalanced dataset will introduce bias into the model, as it learns to preferentially classify MIDI samples as Bach, to statistically minimize classification loss. To deal with this, we used SMOTE to equalize the training data (Figure 3).
Figure 3: The effect of SMOTE on addressing data imbalance in the training set.
#Visualizations
Prior to deep learning classification, visual analysis of the differences in features amongst the composers can reveal differences amongst the four composers.
Looking at tempo distribution, one can see that the tempo distributions for three of the four composers tended to have tempos of around 100 to 150 bpm (moderato to allegro). Bach notably composed many pieces to be played much slower (adagio, andante) or much faster (vivace and presto) than the norm (Figure 4).
Figure 4. Tempo distribution by composer
This is also seen in the plot of the average velocity distribution of notes in the various compositions (Figure 5). The violin plots show that the works of Chopin and Beethoven were similar in tempo and note velocity. Although the tempo of Bach’s works were not necessarily rapid, he did incorporate more high-velocity notes within the compositions.
Figure 5. Average note velocity distribution.
Another way to distinguish among the composers is to examine the number of instruments vs average velocity (Figure 6).
Figure 6: Number of instruments in each composition plotted against average note velocity.
For composers such as Chopin, who composed primarily for the piano, most of his data points are along the left of the graph. Mozart, Bach and Beethoven composed works for the piano as well as chamber and orchestral ensembles.
A first pass at delving into musical harmonics and chromaticism in the composers’ musical style is to examine the pitch classes utilized. In Figure 7, we plotted the pitch class usage for each composer, normalized to the number of compositions. Pitch class reflects the key signatures used in various musical compositions, but also takes into account accidentals, so by itself, it doesn’t necessarily reflect how “adventurous” a composer was in straying from the diatonic path.
Figure 7: Normalized pitch class usage by composer.
Another way of visualizing the variation in pitch class usage by each composer is a radar plot, as in Figure 8. In this plot, it is clear that the music of a composer like Chopin used a more even distribution of pitch values, either through key signature selection or a more liberal use of accidentals. Other composers tended to favor certain pitch classes over others.
Figure 8: Radar plot of normalized pitch class usage by composer
Finally, we delve into chromaticism in Figure 9, looking at pitch_classes that were only at non-diatonic intervals related to the key signature. This takes into account accidentals, but also includes harmonic expressions and progressions that extend beyond tonic and dominant chords. It is not surprising that Mozart’s music favored major chords, and that both Bach and Chopin were more adventurous in the pitch_classes they used. Beethoven was intermediate, and incorporated more non-diatonic notes on the chromatic scale.
Figure 9: Average proportion of non-diatonic notes by composer and key.
Now that we have seen how pretty_midi can extract features which can distinguish between the various composers, we now turn our attention to showing how this utility can be used to train deep learning models to classify the MIDI files into their respective composers.
We have seen that the selected features are indeed able to evince differences amongst the musical styles of the four composers. Before embarking on the classification process, we sought to explore which of these features would be most contributory in the classification.
One way to determine this would be to use Random Forest modeling, which provides a means of ranking feature importance by way of determining the relative contribution of feature to Gini impurity index. The advantage of this method is that it is relatively easy to calculate, but it is a method that is dependent on the Random Forest classification model. We performed this analysis and found that the most important features were num_instruments, velocity variance and polyphony (Figure 10).
Figure 10: Gini importance values in a Random Forest model, to determine the relative feature importance in the classification of composers.
A more detailed analysis of the relative contribution of feature to composer classification is provided by SHAP (SHapley Additive exPlanations). In Figure 11, we see the relative importance of each feature toward the classification decision in each composer’s MIDI dataset.
Figure 11: Mean absolute SHAP values for each composer.
Separate beeswarm plots further detail the relative importance of each feature in composer classification (Figure 12). It is clear that num_instruments, velocity_variance and polyphony are among the topmost important features used in classification.
Figure 12: Ranked importance of the extracted features for the classification of each composer. Velocity variance was most important for the classification of Bach and Beethoven, however the number of instruments in the composition was of greater importance in the classification of the music of Chopin and Mozart.
Previous work on classification of MIDI samples to different musical styles, found high accuracy using Support Vector Machines, especially with the radial basis function (Gnegy, 2014). As such, we were indeed able to achieve high classification accuracy (0.90) (Figure 13).
Figure 13: Classification report using SVM (rbf).
This final project required classification using two deep learning algorithms: LSTM and CNN. We therefore trained the extracted MIDI information on these two models.
LSTM models require a three-dimensional input shape, which consists of (batch size, time steps, the length of one input sequence). To meet this requirement, the augmented train and test sets were reshaped in preparation for training.
Next, because the target variables are categorical strings, we encoded the y_train and y_test set to integer values using scikit-learn's LabelEncoder. The encoded outputs are converted to a binary matrix representing the composer names. The matrix contains only 1s and 0s, with 1 indicating that the music file belongs to the composer at that index, with 0 indicating that the music file does not belong to the rest. We arbitrarily selected a sequence length of five to create sequences of the training data and test data. It is with these sequences that the model will train upon.
The LSTM model consists of the following layers (in no particular order):
Because we are performing multiclass classification, we chose to use the Adam optimizer with an initial learning rate of 0.01 and categorical cross entropy loss function. To reduce overfitting and facilitate convergence, we implemented early stopping and a learning rate scheduler, both monitoring validation accuracy and loss. Although we set our LSTM model to train for 1000 epochs, early-stopping terminated training at 224 epochs, for training and validation accuracies of 0.87 and 0.85, respectively. The training and validation loss were 0.30 and 0.48, respectively.
Figure 14: LSTM training and validation accuracy (left). LSTM training and validation loss (right).
The training and validation loss graph (Figure 14) shows that our efforts to reduce overfitting has benefited the model, as seen in the convergence of the training and validation loss curves, with the validation loss curve not straddling above the training loss curve.
After making predictions on the X_test set, we generated a classification report.
Figure 15: LSTM classification report
The LSTM model had an accuracy of 0.85 (Figure 15). This accuracy level is evident in the confusion matrix (Figure 16), as we see that most of the predictions lie in the diagonal row of the matrix. This means that most of the predictions were correct. The scarce numbers in the cells surrounding the diagonal row signify instances of incorrect predictions with the y-axis indicating the true label and the x-axis indicating the predicted label. Also from the diagonal row, we see that most of the samples in the test set are from Bach because SMOTE synthetic oversampling is not applied to the test set.
Figure 16: Confusion matrix for the LSTM classification model.
Likewise with LSTM, the CNN model requires a three-dimensional input shape. We followed the same steps with reshaping the data, but we excluded the sequencing step, as that step is required in LSTMs due to its recurrent nature.
The CNN model consists of the following layers (in no particular order):
These layers were developed empirically, balancing the need to achieve convergence and training accuracy, and the need to avoid volatility in validation accuracy and avoiding overfitting. Because we are performing multiclass classification, we chose to use the Adam optimizer with an initial learning rate of 0.001 and categorical cross entropy loss function. To reduce overfitting, we implemented early stopping and a learning rate scheduler, both monitoring the validation loss. Again, we set the number of epochs for our CNN model to 100, however early stopping terminated training at 58 epochs, yielding training and validation accuracies of 0.94 and 0.90, respectively. The training and validation losses were 0.18 and 0.33, respectively.
Figure 17: CNN training and validation accuracy (left). CNN training and validation loss (right).
The training and validation loss graph (Figure 17) shows that our efforts to reduce overfitting has benefited the model, as seen in the convergence of the training and validation loss curves, with the validation loss curve slightly straddling above the training loss curve.
After making predictions on the X_test set, we generated a classification report.
Figure 18: CNN classification report
The CNN model had an accuracy of 0.91 (Figure 18). Likewise with the LSTM model, the confusion matrix (Figure 19) consists of the majority of predictions sitting in the diagonal row, indicating a high majority presence of correct predictions. From a visual standpoint, there are few incorrect predictions. Again, there is a high amount of Bach instances, since synthetic oversampling is only intended to be applied to the training datasets.
Figure 19: Confusion matrix for the CNN classification model.
We have shown that with careful selection of musical features, we were able to extract key information stored in MIDI files of composers’ musical oeuvres, which allowed deep learning networks to be trained on these characteristic features, such that they could successfully classify the composer of the music accurately on a separate validation set. Our models achieved an accuracy of at least 0.85 overall. The main limitation of the dataset was the relative imbalance of representative files amongst the various composers, and we endeavored to correct for this using SMOTE data augmentation to guide the models away from overtraining on composers, such as Bach, who had the greatest representation in the dataset. Despite this, the individual f1-scores reflected each composer’s dataset size. It is notable, however, that despite having the least number of compositions in the dataset, both the LSTM and CNN models classified better on Chopin compositions than they did on Beethoven or Mozart.
Some factors that impacted our models’ performances were the size of the dataset and the class imbalance. Neural networks perform better with larger datasets, as they are able to capture more feature details due to the large number of trainable parameters. The class imbalance in our music dataset introduced a bias favoring Bach, which necessitated data augmentation to mitigate this bias. SMOTE was used to even out the class imbalance. One downside to this is that the minority classes have more instances of synthetic data than that of Bach. Ideally, we would want the dataset to comprise real data as much as possible. However, data augmentation was necessary in this situation to help the neural networks to better generalize on unseen data.
We attempted to intuit how each deep learning network approached the task of classification by examining some of the characteristics of the features obtained by pretty_midi for each composer. In our EDA, we found that different MIDI features were more salient amongst the various composers, and these differences could potentially explain how the models were trained. For example, with Bach’s music, we learned that the strongest predictors were the number of instruments, velocity variance, and polyphony. For the remaining composers, recognizing the most significant predictors was not as deterministic. We are not certain that the networks operated using this kind of analytic breakdown, however it is conceivable to imagine them trained in this fashion. The EDA did give us the confidence that we had selected an adequate number of suitable features for training, however.
It is intriguing that using the feature set we had selected, pretty_midi performed less accurately distinguishing between Chopin and Beethoven, which can be seen in the analysis of the tempos among the composers. Although their music is considered to be from the Romantic era, a human listener would have no problem distinguishing differences in their musical styles, suggesting that there are musical characteristics registered by the human brain that cannot be captured by dry assessments of features such as tempo or note velocity. The modal selection of a piece, selection of chord voicings, the evocation of nationalistic or cultural melodic motifs, and association with tunes with which we are already familiar, all affect our ability to recognize and classify a particular piece. These features are part of the humanistic aspects of music which MIDI was not designed to capture, nor would it likely be able to.
Besides using more balanced data, we believe there is still room for improvement in the implementation of the neural networks. Many hyperparameters could potentially benefit from fine-tuning, such as learning rate, optimizer type, batch size, and sequence length in the case of LSTMs.
The ability for deep learning networks to capture essential musical features has important implications. Our results suggest many possibilities in the application of MIDI analysis and classification tools. With further training, larger dataset, and fine-tuning, our models can be extended to identify music beyond the classical genre and extend to other topics, such as emotion, time period, instruments used, and more. In addition to this, our models can be applied to enhanced compositional assistance in generative AI; voice-to-MIDI controllers; enhanced recommender systems; automated mixing and mastering of music; and the design of new instruments and effects for synthesizers in music production.
To summarize, the model results have demonstrated the effectiveness of using deep learning to perform MIDI analysis and classification, which can serve both producers and listeners of music.
Craffel, C. (2023a). Pretty_midi. https://github.com/craffel/pretty-midi
Craffel, C. (2023b). Pretty_midi documentation. https://craffel.github.io/pretty-midi/
Gnegy, C. (2014). Classification of musical playing styles using MIDI information. Stanford University. https://cs229.stanford.edu/proj2014/Chet%20Gnegy,Classification%20Of%20Musical%20Playing%20Styles.pdf
Kaggle. (2019). Midi_classical_music. https://www.kaggle.com/datasets/blanderbuss/midi-classic-music