Skip to content

active learning loop workflow on custom dataset #1087

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

naxatra2
Copy link
Contributor

@naxatra2 naxatra2 commented Jul 3, 2025

I have created a new notebook which is not linked to deepforest in any ways, I just wrote my code in the same directory for convenience. This notebook tries to see whether my model is learning in the way that I want or not.

I have used a custom dataset of daisy flowers with 131 annotated images in the COCO dataset with images format. To reproduce this code we need training image dataset with its annotations either in .json or .csv format and a test dataset.

I have used a very light model for training, to reduce time. If i replace it with a better model then the accuracy can improve

Objective

simulating how an object-detector’s accuracy (mAP) improves as I will iteratively label more images. This example currently uses random sampling, which (I will take as baseline) from the unlabeled pool, next step is to use an active learning specific sampling tecohnique.

How to reproduce this (without shipping the giant flower dataset)

  1. Prepare your own dataset in COCO format (or convert from VOC/Pascal/CSV into COCO).

    • You need a JSON with images, annotations, and categories keys. Each annotation must have image_id, bbox in [x,y,w,h], and category_id.
    • Put your image files in two folders (one for training + pool, one for the fixed test/val split).

Majorly 3 steps in the workflow

  1. COCO ↔ CSV conversion

    • parse_coco(json_file, img_dir) function reads a COCO-style annotation JSON and writes out a flat labels_raw.csv, with one row per bounding box (xmin,ymin,xmax,ymax,label,image_path).
    • build_coco_gt(df, out_json) utility takes that CSV back into a minimal COCO JSON (images, annotations, categories) so that we can use it later as the “ground truth” for evaluation.
  2. Custom Dataset + DataLoader

    • FlowerDataset class (subclassing torch.utils.data.Dataset) whose __getitem__ loads an image, retrieves its boxes and labels from my CSV/COCO data, applies resizing, converts everything to tensors, and returns (image, target_dict) for TorchVision detection models.
  3. Active-Learning Loop

    • For each of ROUNDS cycles:
      1. build DataLoaders for the current train_idx and test_idx.
      2. train a fresh Faster-RCNN on the labeled subset.
      3. evaluate on the validation set, record the mAP, and print it.
      4. randomly sample POOL_BATCH new images from the pool to add to train_idx
  • After all rounds finish, I plot the number of labeled images against [email protected] to see how performance scales as I label more data.

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 3, 2025

@jveitchmichaelis @bw4sz can you please look into this repo.

this is made in reference to this comment form #1069

Thanks @naxatra2, I think a good next step would be to demo a training + random sampling example, without label studio for now. I suggest you use images that we already have labels for (to avoid the human step). Let us know if you need any help putting that together.

It would be interesting to start totally from scratch e.g. a RetinaNet model that has been trained on MS-COCO. Use that to select your first X images, train and repeat. A nice outcome from the project would be a graph of number_of_images vs model_performance with different sampling strategies. Once we have your loop ready, those experiments should be easy to run on the UF cluster.

@jveitchmichaelis jveitchmichaelis added the Google Summer of Code This label is for ideas that could be used for google summer of code. label Jul 3, 2025
@jveitchmichaelis
Copy link
Collaborator

Thanks @naxatra2, some comments:

  • Good to see a self contained example.
  • My suggestion for these experiments is to always run a baseline training loop first, where you attempt to achieve a good score on the test set using the entire training dataset. Normally what you see in papers for active learning is that if you can use the entire dataset, you'll get the best/same results if you train long enough.
  • What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).
  • We usually always start with something trained on MS-COCO or a similar backbone, at least unless the training dataset is 1000s of images. Oddly enough it looks like there isn't a flower class in COCO, which surprised me (there is "potted plant").
  • The mAP in your plots is still basically zero, so it looks like we're a long way from convergence.
  • You're training for a single epoch? I'm not sure what the literature recommends here, but this is very few iterations if your dataset is only 1-2 images at a time.

@bw4sz do you have a suggestion for which tree/box dataset we should start exploring to test this? Neon?

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 3, 2025

I was confused in the dataset part, due to which instead of choosing a heavier and better model (possibly pretrained), I opted to use a very light model that I could run without GPU, I mostly did this to check whether my implemented logic was working or not. I think this creates most of the issues in my notebook. For example this part:

What pre-training options are you intending to use here? The weights parameter is current I think, rather than pretrained/pretrained_backbone (you also have some different initializations in the notebook with false/true, true/true).

This is mostly because I was experimenting with multiple models, and I forgot to clean the code. I think this and the small dataset is the reason behind my almost negligible mAP values

Also, I initially thought of using the NEON dataset, but it was too big and from github I was only able to find the annotations not the training dataset. so, I just used a very basic custom dataset to structure my notebook.

@jveitchmichaelis
Copy link
Collaborator

If you're running in a Notebook, I would recommend using Google Colab for free GPU access (Nvidia T4). Disk space should be plenty on there too.

@naxatra2
Copy link
Contributor Author

naxatra2 commented Jul 5, 2025

Hi @bw4sz , you mentioned something about a demo code or structure related to my model training in the last meet. Can you provide it please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Google Summer of Code This label is for ideas that could be used for google summer of code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants