Skip to content

Near duplicates in vessel detection train dataset #127

Open
@robmarkcole

Description

@robmarkcole

I created a func to visualise the labels:

def display_annotations(folder):
    """
    Display the RGB image with annotation points overlaid in red,
    using map bounds from metadata to compute correct pixel scale.
    Assumes bottom-left origin.
    """
    # Paths
    img_path = os.path.join(folder, 'layers', 'sentinel2', 'R_G_B', 'image.png')
    meta_path = os.path.join(folder, 'metadata.json')
    geojson_path = os.path.join(folder, 'layers', 'label', 'data.geojson')

    # Load image
    img = Image.open(img_path)
    W, H = img.size

    # Load spatial metadata
    meta = json.load(open(meta_path))
    xmin, ymin, xmax, ymax = meta['bounds']

    # Derive resolutions from bounds and image size
    x_res = (xmax - xmin) / float(W)    # map units per pixel X
    y_res = (ymax - ymin) / float(H)    # map units per pixel Y

    # Load annotations
    gj = json.load(open(geojson_path))
    features = gj.get('features', [])
    if not features:
        print("No features to display.")
        return
    print(f"Found {len(features)} features.")

    # Convert geo coords -> pixel coords
    xs, ys = [], []
    for feat in features:
        X_geo, Y_geo = feat['geometry']['coordinates']
        col = (X_geo - xmin) / x_res
        row = (Y_geo - ymin) / y_res
        xs.append(col)
        ys.append(row)

    # Plot with origin='lower'
    fig, ax = plt.subplots(figsize=(12, 12))
    ax.imshow(img, origin='lower')
    ax.scatter(xs, ys, s=2, marker='o', c='red') # edgecolors='k'
    ax.set_xlim(0, W)
    ax.set_ylim(0, H)
    ax.axis('off')
    plt.show()

I've observed multiple instances of near duplicates in the train data, e.g.

  • 2062196_1236837_85862 # Found 1 features.
  • 2062196_1236837_86652 # Found 2 features.

They are the same scene, except the second has an additional feature, and the location of the first is slightly offset. Is this due to slightly different timings in AIS? Any thoughts about whether this negatively impacts training?

Image Image

And comparison of the labels:

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions