Benne Holwerda's Research Blog: 2025

Wednesday, February 5, 2025

AI as an citation equalizer?

At the American Astronomical Society meeting in Washington DC last month (January was loooooong subjectively this year), I sat in a fascinating talk given by Karteyk Iyer. He and his collaborators developed a clustering map of the astrophysics literature. Quite literally the current landscape of the field.

The landscape of astronomical literature. You can go stand on a mountain of previous work or wade into the shallows of inter-disciplinary topics.

This maps out where lots (enough?) has been said already and where more could be done.

To show this in a little more scientific heat map:

The distribution of topics in astronomy as mapped onto a density map. It is in effect a giant clustering algorithm of vecorized papers.

What made me perk up about this was the option to ask a question or put in a phrase and get a list of papers that talk about that subject. This is partly LLM based and therefore can interpret plain language.

This hopefully circumvents the problem that certain issues had different nomenclature. For example “dark matter” (Vera Rubin’s and earlier Zwicky’s phrase) was talked about as “non-stellar mass-to-light ratio” by the radio astronomers. In effect the same thing. A sufficiently trained LLM can perhaps circumvent that.

You can try this here:

https://huggingface.co/spaces/kiyer/pathfinder

On the one hand this makes me very excited. Putting in a phrase as part of my paper followed by a bunch of citations always elicited a worry that I was overlooking someone’s work unfairly. Simply because my memory of who did what is fallible at best and completely out to lunch at worst. I’m not alone in this, part of a referee’s job is to suggest more references that are relevant to the subject at hand. Better but still very fallible. I have tried some other tools (use both google scholar and ADS for example, connectedpapers was another).

And it has been shown that for example women are cited less in science (Nature: https://www.nature.com/articles/s41586-022-04966-w). I can imagine nothing more nefarious than poor human memory is to blame. I can also imagine other reasons.

So I was excited at the possibility of at least improving this issue of fair citations. One can put in the phrase that preceded the citations and see if more references pop up. Maybe to jog your memory.

This would be virtuous (give proper credit) but also make it easy to do (easy virtue….wait…).

The list often includes false positives I’ve found while playing around with it. It’s ranking of most relevant works certainly did not align with my personal one (that is also very subjective, which paper convinced you has something to do with which one you encountered first).

There is also no way to include feedback. And with many AI/ML applications, this is all dependent on the ingested training sample. A lot of science has been done before 1995 for example but may not be part of the ADS data-base. Has every journal been included? Some poorer academics only publish their results/catalogs on astro-ph because that is all they can afford (preprints not included right now), what was in a textbook etc etc.

But this shiny new AI tool may help me improve my citation practice. It might help yours as well.

Tuesday, January 28, 2025

More Machine Learning with WALLABY

Title: WALLABY Pilot Survey: kNN identification of perturbed galaxies through HI morphometrics

Link

My paper on the level or perturbed looking galaxies in a couple of WALLABY (Widefield ASKAP L-band Legacy All-sky Blind surveY). The reason that I wanted to see how well this worked on HI data and using the morphometrics that I have come to rely on to parameterize HI morphology.

What made it possible at all was a paper that classified sources in two of those fields into different levels and types of perturbed (Lin+ 2023). Looking at the plot below, that seems like a reasonable sized sample to perhaps try some simpler machine learning out on, building on the work I had done in 2011 on a variety of HI surveys and in 2023 on the WALLABY pilot data.

I simplified the label into simply perturbed-looking and “not” since this is not the biggest training set used in ML.

But these galaxies were closer than those in 2023 and there was some reason to think it might work a little better.

Let’s find out!

There is a pretty good spread in values and maybe some separation in this parameter space. Excellent material to try some (simple) classifiers out on. The one I picked was kNN since it is conceptually easy, classify as a few nearest neighbors in the n-dimensional space. The space does not need to be orthogonal, which the morphometric space definitely isn’t. And there is only one thing to tune: the number of neighbors.

Checking for the optima number of neighbors. I did to the feature engineering (picking which parameters to feed it) already. This was partially motivated by my experiences with some of them (Smoothness is not good, Intensity also needs a smoothing kernel).

As we can see from the plot above, the optimal number of neighbors is 2, after which all the metrics diverge and degrade. Oka 2 neighbors (a bit low maybe?) it is!

Because the training set is still pretty small, I did the train-test loop a couple of times to get an idea of the average performance. We get the confusion matrix below:

The average performance of the test sample using a series of train/test of the kNN.

Not terrible. Not amazing either.

If we use the full training sample to train (and no separate testing) we get:

What do I get if I use the whole labeled data set for training. You have to worry about over-fitting but OTOH this is still a small training sample.

A little better. We really could do with a bigger training sample but that is a refrain in ML. Okay which ones are predicted to be perturbed? It’s these. If you compare with the ones above, it’s a fair first cut.

kNN predictions for which galaxies are perturbed. In individual cases it works…sort of. As a fraction, it works very well.

And we had a second field where there were more galaxies (it’s wider) so we could apply our kNN classifier there too:

Predicted kNN pertured galaxies in the NGC 5044 field.

Now we have a precision for when this field is studied in more detail for signs of perturbed galaxies.

What we have found is that the kNN classifier is pretty decent to get the fraction of galaxies that is perturbed in a given field. For individual galaxies however it is best to think of this as a prediction. With an accuracy of 80%, meaning 1.5 times it still gets it wrong. The good news is that this is pretty easy to beat, with both better training sets, and perhaps a direct classification from HI maps to a perturbed label, rather than going through morphometrics first. One could use perhaps the first order map (the volocity map) instead of the column density one to train a convolutional neural network. But for now we have a prediction of NGC 5055 galaxies and some more intuition how to apply machine learning on HI maps morphology.

Monday, January 6, 2025

Two Books on a chaotic Cosmos

I recently finished “Our Accidental Universe” by Prof. Chris Lintott and just before that I re-read “The Disordered Cosmos” by Prof. Chandra Prescott-Weinstein. Both are books for the general audience to explain the nature of our Universe told by some of the best explainers in the business. They are as far apart in style as I can tell as possible. I like reading books this way, in contrasting pairs. Both are a series of essays/chapters on topics in Astronomy highlighting the often randomness of our Universe and serendipity in our discoveries.

The Disordered Cosmos (DC) originated from blog posts and this shows some in the writing and language. There are footnotes explaining terms or author’s asides but no extended reference list.

Our accidental universe (AU) is in that sense a much more “traditional” popular Astronomy book, footnotes for jokes and asides, with a long list of where the author got his information from for these stories.

The bigger differences between these two books are how much you meet the author personally. In AU, you meet Chris as younger self briefly to highlight discovery or to set a timeframe. There is a person behind the stories and jokey asides but the author keeps his privacy. This is very different for the DC. Here we meet the author personally and writing about intensely personal things and the various identities she brings to the science of Astronomy/cosmology. It is a much more emotional read. When I read it for the first time in the spring of 2019, a lot of the frustrations with the system of physics resonated with me. My experiences as an immigrant white dude here are nowhere near as bad as some of those described in the DC. But the feeling on being judged on emotional labor for students or feeling of not belonging thanks to physics group dynamics. Yeah that hit pretty solidly mid-tenure. I thought that spring semester in 2019 was the roughest (“oh my sweet summer child” is the phrase I think).

So I will be honest, the second half of the DC did not stick in my brain at all. Too stressed. So I really wanted to reread it. When I am overwhelmed by a 3 class semester where one is a new class and assorted chaos, yeah I need to read space kablooie, not the Disordered Cosmos.

The AU is a very smooth read. I also know how Chris sounds so I heard it in his voice and frankly he just sounds so much like the BBC. Stephen Fry encountered this when people in prison were confident he would go to university. Purely because of his diction. I’ve heard a British friend describe it as “plummy”. It is a much more relaxing read, often because I heard part of the stories too.

So when we read books about Astronomy or our Universe, you might expect a soothing story where the narrator has a nice accent in your head. But if you want a glimpse how this sometimes goes down in the heads of people doing the work, the DC is a much more direct and honest look in to the very human and flawed endeavor that is the science of Astronomy.

The fun part is that both books have a similar takeaway message: understanding the cosmos is fun. Be it as we are given a tour by a BBC voice or shown it as an act of resistance against the worse human behaviours. Exploring the Universe is chaotic, random and above all fun.