active learning loop workflow on custom dataset #1087

naxatra2 · 2025-07-03T12:56:15Z

I have created a new notebook which is not linked to deepforest in any ways, I just wrote my code in the same directory for convenience. This notebook tries to see whether my model is learning in the way that I want or not.

I have used a custom dataset of daisy flowers with 131 annotated images in the COCO dataset with images format. To reproduce this code we need training image dataset with its annotations either in .json or .csv format and a test dataset.

I have used a very light model for training, to reduce time. If i replace it with a better model then the accuracy can improve

Objective

simulating how an object-detector’s accuracy (mAP) improves as I will iteratively label more images. This example currently uses random sampling, which (I will take as baseline) from the unlabeled pool, next step is to use an active learning specific sampling tecohnique.

How to reproduce this (without shipping the giant flower dataset)

Prepare your own dataset in COCO format (or convert from VOC/Pascal/CSV into COCO).
- You need a JSON with images, annotations, and categories keys. Each annotation must have image_id, bbox in [x,y,w,h], and category_id.
- Put your image files in two folders (one for training + pool, one for the fixed test/val split).

Majorly 3 steps in the workflow

COCO ↔ CSV conversion
- parse_coco(json_file, img_dir) function reads a COCO-style annotation JSON and writes out a flat labels_raw.csv, with one row per bounding box (xmin,ymin,xmax,ymax,label,image_path).
- build_coco_gt(df, out_json) utility takes that CSV back into a minimal COCO JSON (images, annotations, categories) so that we can use it later as the “ground truth” for evaluation.
Custom Dataset + DataLoader
- FlowerDataset class (subclassing torch.utils.data.Dataset) whose __getitem__ loads an image, retrieves its boxes and labels from my CSV/COCO data, applies resizing, converts everything to tensors, and returns (image, target_dict) for TorchVision detection models.
Active-Learning Loop
- For each of ROUNDS cycles:
  1. build DataLoaders for the current train_idx and test_idx.
  2. train a fresh Faster-RCNN on the labeled subset.
  3. evaluate on the validation set, record the mAP, and print it.
  4. randomly sample POOL_BATCH new images from the pool to add to train_idx

After all rounds finish, I plot the number of labeled images against [email protected] to see how performance scales as I label more data.

naxatra2 · 2025-07-03T12:57:54Z

@jveitchmichaelis @bw4sz can you please look into this repo.

this is made in reference to this comment form #1069

Thanks @naxatra2, I think a good next step would be to demo a training + random sampling example, without label studio for now. I suggest you use images that we already have labels for (to avoid the human step). Let us know if you need any help putting that together.

It would be interesting to start totally from scratch e.g. a RetinaNet model that has been trained on MS-COCO. Use that to select your first X images, train and repeat. A nice outcome from the project would be a graph of number_of_images vs model_performance with different sampling strategies. Once we have your loop ready, those experiments should be easy to run on the UF cluster.

jveitchmichaelis · 2025-07-03T13:58:45Z

Thanks @naxatra2, some comments:

Good to see a self contained example.
My suggestion for these experiments is to always run a baseline training loop first, where you attempt to achieve a good score on the test set using the entire training dataset. Normally what you see in papers for active learning is that if you can use the entire dataset, you'll get the best/same results if you train long enough.
What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).
We usually always start with something trained on MS-COCO or a similar backbone, at least unless the training dataset is 1000s of images. Oddly enough it looks like there isn't a flower class in COCO, which surprised me (there is "potted plant").
The mAP in your plots is still basically zero, so it looks like we're a long way from convergence.
You're training for a single epoch? I'm not sure what the literature recommends here, but this is very few iterations if your dataset is only 1-2 images at a time.

@bw4sz do you have a suggestion for which tree/box dataset we should start exploring to test this? Neon?

naxatra2 · 2025-07-03T14:57:24Z

I was confused in the dataset part, due to which instead of choosing a heavier and better model (possibly pretrained), I opted to use a very light model that I could run without GPU, I mostly did this to check whether my implemented logic was working or not. I think this creates most of the issues in my notebook. For example this part:

What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).

This is mostly because I was experimenting with multiple models, and I forgot to clean the code. I think this and the small dataset is the reason behind my almost negligible mAP values

Also, I initially thought of using the NEON dataset, but it was too big and from github I was only able to find the annotations not the training dataset. so, I just used a very basic custom dataset to structure my notebook.

jveitchmichaelis · 2025-07-03T15:34:08Z

If you're running in a Notebook, I would recommend using Google Colab for free GPU access (Nvidia T4). Disk space should be plenty on there too.

naxatra2 · 2025-07-05T13:44:26Z

Hi @bw4sz , you mentioned something about a demo code or structure related to my model training in the last meet. Can you provide it please

active learning loop workflow on custom dataset

7833f8c

jveitchmichaelis added the Google Summer of Code This label is for ideas that could be used for google summer of code. label Jul 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

active learning loop workflow on custom dataset #1087

active learning loop workflow on custom dataset #1087

Uh oh!

naxatra2 commented Jul 3, 2025

Uh oh!

naxatra2 commented Jul 3, 2025

Uh oh!

jveitchmichaelis commented Jul 3, 2025

Uh oh!

naxatra2 commented Jul 3, 2025

Uh oh!

jveitchmichaelis commented Jul 3, 2025

Uh oh!

naxatra2 commented Jul 5, 2025

Uh oh!

Uh oh!

active learning loop workflow on custom dataset #1087

Are you sure you want to change the base?

active learning loop workflow on custom dataset #1087

Uh oh!

Conversation

naxatra2 commented Jul 3, 2025

Objective

How to reproduce this (without shipping the giant flower dataset)

Majorly 3 steps in the workflow

Uh oh!

naxatra2 commented Jul 3, 2025

Uh oh!

jveitchmichaelis commented Jul 3, 2025

Uh oh!

naxatra2 commented Jul 3, 2025

Uh oh!

jveitchmichaelis commented Jul 3, 2025

Uh oh!

naxatra2 commented Jul 5, 2025

Uh oh!

Uh oh!