I’m going to try and use deep learning to mimic what a guitar amp does.
Here’s the result so far:
To learn how I did this (and to try it yourself), read on…
Here are some thoughts/questions that motivated me to try this:
- I love tinkering and figuring things out. If this actually works, I’d love to end up with a plugin based on deep learning that can mimic any amp I can find…and I’d love to share it with the community!
- I can’t help but wonder: “How complex is a guitar amp?” What does it take to get a neural network to learn how to map from an input signal (guitar DI) to the output (“amped” tone)? Can I dissect different parts of the complete end-to-end tone? Are some parts easier to “get right” than others?
- It seems like the number of plugins or digital amp models available to guitarists is exploding right now. People who sell these have an interest in guarding their IP, which leaves guitarists wondering: “What’s really going on inside these modelers?” I’m not going to reverse-engineer anyone’s product, but I do want to find out for myself what works…and I want to share what I find out.
- How “right” do I have to be? (in terms of quantitative error, e.g. RMSE) before the difference is indistinguishable to the ear? Do some ways of quantifying error correlate better to my subjective experience as a listener?
- It just sounds like a fun thing to try!
Think of the raw guitar DI signal as an input, and the tone coming out of the amp as the output. The amp is just a “function” that maps the input to the output. I’m going to refer to it as f throughout this post. Simply put, amp modelers are just trying to approximate f as best as possible. I can break up approaches to approximating f into two broad camps:
- Physics-based, where one tries to model the individual components of the circuit accurately and have the sum total effect arise from there. This is what modelers like the Axe FX and Helix try to do.
- Black-box/data-driven, where we try to figure out the function from data, i.e., hearing someone play using an example of the amp we want to model. In order to do this, it will help to use some “common sense” to guide our model in the right direction, but hopefully the model can figure out all of the important details after that. The Kemper seems to follow this, though I don’t know what “common sense” they’ve used to help it learn (“profile”) an amp.
My day job involves a lot of data-driven modeling, and I’ve been looking for an excuse to use neural networks for something “real-world”…so here goes!
The neural network
Here’s a little bit of technical information about the neural net. If you just want to see the source code, then you can find a link at the bottom of the post.
Some mathematical notation
Just to make things clear:
- The input signal is taken to be a discrete time series x=(x1, x2, x3…). The output is y=(y1, y2, y3…). Each element is a real number between -1 and 1.
- The amp is some unknown function f. I’m not sure what its inputs should be, but I’ll tell you what I try as I go. For example, if f just needs the current sample at time t, then I could write f(x(t)).
Autoregressive fully-connected network
This is probably the simplest way I could imagine trying to do this and a great place to start. I take as input the last n samples, and try to use them to predict the output at the last time point. In other words, for some point in time i, I assume that f has the form
y(i) = f(x(i), x(i-1), …x(i-n+1)).
I have to pick n, which tells me how far the neural network can look “back in time” to predict what should happen right now. This is important because a lot of the amp’s character comes from factors that have a slight delay (power sag & saturation, etc).
One big drawback of this approach is that picking n might be tricky: too short, and I might not have all of the info I need to predict the amp’s behavior correctly; too long, and it might be hard to sort through all of the information I have available. I need to strike a balance.
Like I said, I’m just using a simple feed-forward neural net (FFNN). FFNNs take a vector input, transform it linearly, then run the result through some nonlinear function. In other words, one layer of our neural net takes as input h(i) and computes
h(i+1) = a(Wh(i) + b).
W is just a matrix, and b is a vector. a is our nonlinear activation function. As a first try, I’ll just use a ReLU for a since they tend to work well for lots of tasks. However, since we need the output y to be between -1 and 1, I’ll use a sigmoid on the last layer (and rescale it accordingly). I’m not sure that this is strictly necessary if the model trains well, but I don’t want the output to clip.
Experiment 1: A distortion plugin
After doing some sanity checks, I grabbed the JS distortion plugin from Reaper. Here’s the DI that I used as training input:
…and here’s the [training] output from the plugin:
I then trained a neural network on this data [details below]. To check how well training was going, I did a few things:
- I computed the average error (RMSE) in the neural network’s predictions on a validation set.
- I plotted the waveform that the neural net predicts to compare visually against the true output.
- I made some audio files so that I could listen to the results.
After 1000 iterations on minibatches of 4096 input-output examples, the neural net’s predictions are almost visually indistinguishable from the plugin’s output…
…and things sound pretty good as well! Here’s a comparison between the plugin and the neural net. (Warning: this won’t sound very good because it’s missing a lot of the ingredients to making a good tone!)
But the output of the plugin looks a little funny to me. Specifically, it looks like the output is just the input, scaled up by some factor.
I checked this by plotting the input vs the output for the training data. Each point is a sample in time.
What we see is that the plugin is actually just doing a very simple nonlinearity to the input signal that only depends on the input at any given instant–not on its history. This is in contrast to the “dynamic” behavior that we experience in real-world analog circuits whose input-output behavior responds to to their input over extended amounts of time (even if that’s only a fraction of a second)…so this plugin isn’t very complex. We’ll try something harder next.
Experiment 2: Ignite Emissary
So, I went back to Reaper and pulled up the Ignite Emissary. I personally feel that this is one of the best free amp sims to come out recently, so I hope that it will have more challenging behavior to model. To keep things simple, I’ll start by trying to match the amp’s output without a cabinet impulse response. It would be interesting to find out in the future if we can capture the full end-to-end signal chain, but we’ll start here.
I needed to use a bigger neural net for this experiment to get good results (see below). After 50k minibatches, the RMSE was just above 0.01. Here’s a look at the predicted waveform at that point:
Not as perfect as before, but it looks like I’ve gotten a lot of the main idea–the neural net prediction (green) follows the output from the Emissary (orange) pretty closely.
However, the real question is does it sound the same? This is somewhat subjective, but I’d intuitively expect that a low RMSE should (roughly) correspond to a good subjective match.
Here’s a clip of the Emissary’s output:
…and the neural net:
Personally, I can’t hear the difference between the two. Given the visual mismatch above, this surprised me! However, this suggests to me that my ears can’t tell the difference long before my eyes can’t. Getting some error like RMSE to a very small number isn’t necessary, but it may be sufficient in practice to ensure good results.
Bonus: A/B testing (soloed and in a mix)
For fun, I took the result from Experiment 2 and threw it into a full mix, switching back and forth between the Emissary and the neural net, just to hear things in context. That result is at the top of this post.
Here’s a few details to know about this full mix:
- See below to download the IR that I used.
- I applied a pretty big EQ cut (to both the neural net output and the real amps, so don’t worry–no cheating!) around 250 and 350 Hz just because I didn’t like the tone on the amp after I put it into a mix, but I didn’t want to get new data and re-train the model.
- All of the processing (i.e. reamping with the neural network) was done off-line (outside of my DAW) via a crude python script (sorry, no plugin quite yet!)
I could use a pretty simple neural network to perfectly learn simpler distortions. That’s a good start!
What kind of worked:
This simple neural network wasn’t quite as good at modeling the Emissary, but the subjective result was still surprisingly stronger than I expected. While there’s clearly a lot of room for improvement, I’m happy that the result seemed to be musically useful, which is an important criterion.
What didn’t work:
Lots of things didn’t work!
- Modeling a tone with a cabinet didn’t work so well. There are good reasons why this is a little harder, and hopefully we’ll get into those in the future. In my experience, getting the tone of the cab right is the most important to a guitar amp’s tone. I might write another post focusing on this failure.
- The code I wrote is too slow to be able to actually use in real-time: it takes more than 1 second to process 1 second of audio–we can’t keep up! Furthermore, we don’t only need throughout; we need low latency. Evidence suggests that we need a maximum of about 5 ms latency for a plugin to be useful in real-time . I’ve got some good insights about which parts of the computations are currently bottlenecks, and I’m optimistic that they can be addressed. Again, maybe this deserves its own post. However, it’s important to note that there is no reason why an autoregressive model is inherently too slow to work–as soon as we have a new sample, we can immediately run it (and the history that we already have) through the network to predict the output corresponding to that sample.
There are a lot of things to work on. Here are a few that come to mind:
- I haven’t explored the generalization of the model too much.
- I also haven’t considered how the training data that I provide impacts generalization, but this is probably very important!
- I suspect that this will work just as well modeling pedals.
- Amps have knobs! It would be great if we could work those into the model.
- Fully-connected neural networks are great…but there’s a lot more we would try. We’ll try models with convolutional layers, recurrent structure, etc.
- Wisdom from modeling speech audio suggests that it might be easier to work with the frequency-domain representation of the data (e.g. a spectrograph). This might be interesting to experiment with as well.
- Everything so far has used digital amp simulators as the target for the neural net. Eventually, we’ll want to get to modeling a real tube amp. I’m sure many people won’t be fully convinced until I do that…but I don’t currently own any tube amps (Let me know if you want to help!)
Here’s more details for curious folks.
Data & code
Like I said, a lot of this is about sharing what I learn. If you want to see the source code or data, here you go! neural-amp-modeler on GitHub
Also, here’s a link to the impulse response that I used for the bonus.
How much history do you need?
As I mentioned above, one choice we have to make with this kind of model is how much input we need to take into account when predicting the next input. If it’s too short, then we won’t be able to capture how the amp responds to its input. If it’s too long, then we waste our time with extra computations, and we might make it hard for the neural net to figure out the right behavior. So, using the right input length is important to getting a good answer in a reasonable amount of time.
How long does it take for the guitar amp to “adjust” to its input? To try and answer this question, I made a test signal that’s silent for one second, then plays an A 440 sine wave for 1 second, then is silent again. I ran it through the Emissary without the IR as above. What I’m looking for is how long (roughly) it takes for the amp to reach steady state.
Here’s the output as the signal starts:
…and as it ends:
Based on this, I’d eyeball that about 8192 samples should be enough to model most of the dynamic response of the amp correctly, and 1024 might be enough to get the rough idea. For a sample rate of 44.1kHz, 8192 samples corresponds to a little more than 20 milliseconds. I’d be interested to hear if that is in the same ballpark of any characteristic time constants in an analog amp. I ended up using 4096 samples as a practical compromise, but haven’t investigated this closely.
(Note: actually, the output comes to rest at exactly zero after about 17k samples after the input stops, but I’m going to assume that I don’t need all of that to get close enough. We’ll keep this in mind and come back to it later if it seems to actually be important to get a good sound.)
More on the architectures
The neural network I used for experiments 1 and 2 (call it “net 1”) used an input length of 32 samples and had 2 hidden layers, each with 32 hidden units. I used ReLU activations for the first two layers and a rescaled sigmoid for the last layer to ensure that the outputs are between -1 and 1. I didn’t try any other activations; maybe something else would work better.
On experiment 2, net 1 didn’t do very well. Here’s a shot of the predicted waveform after 40k minibatches:
Sure enough, it also doesn’t sound very similar.
I suspect the main reason for its poor performance is because, as we saw above, 32 samples just isn’t enough history to get things right. So, I made a bigger net (“net 2”) with an input length of 4096 samples and 4 hidden layers with (1024, 512, 256, 256) hidden units to produce the results above. I suspect there’s an architecture and hyperparameters out there that could get me better results (and get them more quickly), so we’ll get into that more seriously in the future.