No this is not a story from a neighborhood chatgroup. I had a machine Learning question specifically for my science interests. In 2024, I classified galaxies a JWST field using k-nearest neighbor (see the paper here). This is a very simple algorithm where you use a space with a training sample spread throughout and every time you get a new data-point, you classify it as the (weighted) average of the k-nearest neighbors to that new point. The only variables is the number of neighbors (k) and whether or not you want to weigh them with distance. It works well since it will have high resolution where you have a lot of information in the training set and will have to average out in sparser areas.
It worked pretty well. I thought it did the trick in JWST. However, I also want to do this in HST parallel fields and there the color-information is as rich as it is in JWST fields. We get four filters and that is it. Back in 2014 I generated an M-dwarf catalog and in 2016 two students from Leiden modeled the Milky Way from this catalog (2014 paper and 2016 paper). I now want to do this for many more fields available in the Hubble archive.
Enter Yggdrasil, the mother tree (of decisions?)
I read about it on Medium here. This seems to be an upgrade to the standard decision tree algorithm; gradient boosted trees. Here the only variable is how many branches you allow for a decision tree.
So I tried it “out of the box” on the same training set to see if we can do better. Here is where we start with kNN classifying. I have a set of colors for objects and I want to know if these are M, L- or T-type dwarfs and what their subtype is. This is a multiclassing problem. A lot of ML algorithms cannot handle that. But kNN and Yggdrasil GBT can.

kNN is definitely not doing bad! But there is a persistent few misclassifications. How does Yggdrasil do?

The difference is not great and this is reflected in the respective precision, recall, F1-score and accuracy.
But call it a hunch, I suspected Yggdrasil could perform a little better than kNN. Or at least I wanted to see if it could. This was with all the information we have on deep fields from both JWST and Hubble.
How well can we separate out by subtype (Delta T)? Here are the metrics by the resolution we are attempting for kNN and Yggdrasil. Very comparable.

The only thing that really stood out was that Yggdrasil starts out a little better at Delta T = 0.2 and 0.3. I used 0.4 in 2024 for kNN since it was at 80% for all the metrics. Looks like Yggdrasil can do better type resolution!
This is the color-space if we only have Hubble:

So let’s try this with just this Hubble set of filters:

The Yggdrasil performs remarkably like it did with JWST+HST (I got a little suspicious there for a second)
So if you’re ok with metrics at ~70%, Yggdrasil can get to within 2 subtypes with just the four Hubble filters. or 4 subtypes it performs as well as the kNN at ~80%.
Reflecting on this, there is a difference between better (which you feel like is always just around the corner) and good enough. That 10% improvement over kNN at Delta T=0.2 means I can slice the population pretty finely into different brown dwarf types and the 70% accuracy means I have enough of them identified to model the Milky Way distribution? That will be next.
No comments:
Post a Comment