Benne Holwerda's Research Blog: More Machine Learning with WALLABY

Title: WALLABY Pilot Survey: kNN identification of perturbed galaxies through HI morphometrics

My paper on the level or perturbed looking galaxies in a couple of WALLABY (Widefield ASKAP L-band Legacy All-sky Blind surveY). The reason that I wanted to see how well this worked on HI data and using the morphometrics that I have come to rely on to parameterize HI morphology.

What made it possible at all was a paper that classified sources in two of those fields into different levels and types of perturbed (Lin+ 2023). Looking at the plot below, that seems like a reasonable sized sample to perhaps try some simpler machine learning out on, building on the work I had done in 2011 on a variety of HI surveys and in 2023 on the WALLABY pilot data.

I simplified the label into simply perturbed-looking and “not” since this is not the biggest training set used in ML.

But these galaxies were closer than those in 2023 and there was some reason to think it might work a little better.

Let’s find out!

There is a pretty good spread in values and maybe some separation in this parameter space. Excellent material to try some (simple) classifiers out on. The one I picked was kNN since it is conceptually easy, classify as a few nearest neighbors in the n-dimensional space. The space does not need to be orthogonal, which the morphometric space definitely isn’t. And there is only one thing to tune: the number of neighbors.

Checking for the optima number of neighbors. I did to the feature engineering (picking which parameters to feed it) already. This was partially motivated by my experiences with some of them (Smoothness is not good, Intensity also needs a smoothing kernel).

As we can see from the plot above, the optimal number of neighbors is 2, after which all the metrics diverge and degrade. Oka 2 neighbors (a bit low maybe?) it is!

Because the training set is still pretty small, I did the train-test loop a couple of times to get an idea of the average performance. We get the confusion matrix below:

The average performance of the test sample using a series of train/test of the kNN.

Not terrible. Not amazing either.

If we use the full training sample to train (and no separate testing) we get:

What do I get if I use the whole labeled data set for training. You have to worry about over-fitting but OTOH this is still a small training sample.

A little better. We really could do with a bigger training sample but that is a refrain in ML. Okay which ones are predicted to be perturbed? It’s these. If you compare with the ones above, it’s a fair first cut.

kNN predictions for which galaxies are perturbed. In individual cases it works…sort of. As a fraction, it works very well.

And we had a second field where there were more galaxies (it’s wider) so we could apply our kNN classifier there too:

Predicted kNN pertured galaxies in the NGC 5044 field.

Now we have a precision for when this field is studied in more detail for signs of perturbed galaxies.

What we have found is that the kNN classifier is pretty decent to get the fraction of galaxies that is perturbed in a given field. For individual galaxies however it is best to think of this as a prediction. With an accuracy of 80%, meaning 1.5 times it still gets it wrong. The good news is that this is pretty easy to beat, with both better training sets, and perhaps a direct classification from HI maps to a perturbed label, rather than going through morphometrics first. One could use perhaps the first order map (the volocity map) instead of the column density one to train a convolutional neural network. But for now we have a prediction of NGC 5055 galaxies and some more intuition how to apply machine learning on HI maps morphology.