Repo: https://github.com/DeanIA/yolo_sam
My final project looked for a machine learning workflow that would allow me to infer meaningful information from satellite images. I looked at a number of different options until I found a Yolo/SAM combination.
Here is a quick summary of my research, and some background for my weird interest which isn’t interactive (yet) and some might even dare argue isn’t a telecommunication.
I definitely took the scenic route for this one, looking at a number of different models and workflows and studying about different machine learning tasks, particularly in the field of computer vision, and how they are applied to geospatial data like satellite imagery.
Different satellites take different kinds of pictures. These pictures can contain more than your run of the mill RGB light bands, therefore called multispectral. Intuitively I thought that processing these images, compared to RGB, would boost the accuracy of classification problems.
To check this, I compared the performance of a VGG16 CNN and Convmixer hybrid CNN/ transformer model on the classification task of the Euroset dataset that is available in both RGB and multispectral version.
The VGG trained on the RGB version of Euroset, and the Convmixer trained on the MUL version of Euroset. After running a transfer learning training loop for both, i proved my intuition wrong, as the model using the RGB dataset blew the multispectral one out of the water 94 vs 84 percent accuracy.

VGG-16

Convmixer
Next I started looking at ways to segment satellite imagery, and found a promising technique developed by academic researchers. The problem that they were trying to solve was that while semantic segmentation models can classify pixels, they aren’t very good at describing accurate shapes; while models that preform non-semantic segmentation produce sharp masks, but can’t classify them.
So they found a way to pass the same dataset through the semantic and non semantic segmentation models. Each model would produce segmentation masks which they would compare to find overlap, and calculate the most accurate way to merge the two for sharper categorized masks. This thought me a lot but ultimately didn’t work well enough, improving the accuracy of agriculture masks from 65% to 75%. But inspired the workflow that works.


I also looked at a new open source GeoAI library based on Fast RCNN. These were easy to set up and good documentation. It looks like it’s going to be great, but it wasn’t powerful enough.


Dr. Qiusheng Wu ❤️🔥🐐
Instead of semantic segmentation, I trained a Yolo model that creates semantic bounding boxes on a dataset of satellite images. The results were incredibly accurate. The bounding boxes are then exported to a txt file, processed, and fed into a SAM2 model, which can create an accurate mask of the object in the box. This allows both counting of objects (like buildings) with the yolo model, and measuring area (like growth of agricultural land) by chaining the yolo and sam models.


Yolo to SAM