Near duplicates in vessel detection train dataset

I created a func to visualise the labels:
```python
def display_annotations(folder):
    """
    Display the RGB image with annotation points overlaid in red,
    using map bounds from metadata to compute correct pixel scale.
    Assumes bottom-left origin.
    """
    # Paths
    img_path = os.path.join(folder, 'layers', 'sentinel2', 'R_G_B', 'image.png')
    meta_path = os.path.join(folder, 'metadata.json')
    geojson_path = os.path.join(folder, 'layers', 'label', 'data.geojson')

    # Load image
    img = Image.open(img_path)
    W, H = img.size

    # Load spatial metadata
    meta = json.load(open(meta_path))
    xmin, ymin, xmax, ymax = meta['bounds']

    # Derive resolutions from bounds and image size
    x_res = (xmax - xmin) / float(W)    # map units per pixel X
    y_res = (ymax - ymin) / float(H)    # map units per pixel Y

    # Load annotations
    gj = json.load(open(geojson_path))
    features = gj.get('features', [])
    if not features:
        print("No features to display.")
        return
    print(f"Found {len(features)} features.")

    # Convert geo coords -> pixel coords
    xs, ys = [], []
    for feat in features:
        X_geo, Y_geo = feat['geometry']['coordinates']
        col = (X_geo - xmin) / x_res
        row = (Y_geo - ymin) / y_res
        xs.append(col)
        ys.append(row)

    # Plot with origin='lower'
    fig, ax = plt.subplots(figsize=(12, 12))
    ax.imshow(img, origin='lower')
    ax.scatter(xs, ys, s=2, marker='o', c='red') # edgecolors='k'
    ax.set_xlim(0, W)
    ax.set_ylim(0, H)
    ax.axis('off')
    plt.show()
```
I've observed multiple instances of near duplicates in the train data, e.g. 

- 2062196_1236837_85862 # Found 1 features.
- 2062196_1236837_86652 # Found 2 features.

They are the same scene, except the second has an additional feature, and the location of the first is slightly offset. Is this due to slightly different timings in AIS? Any thoughts about whether this negatively impacts training?

<img width="646" alt="Image" src="https://github.com/user-attachments/assets/4fa6b4f5-7005-4dc5-8699-63d5b8cfddf8" />

<img width="636" alt="Image" src="https://github.com/user-attachments/assets/2100e462-d4d9-4576-9a2c-3179b6a12d17" />

And comparison of the labels:

<img width="1110" alt="Image" src="https://github.com/user-attachments/assets/0d3cb726-def4-43f6-9243-6b8f0d07ea72" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Near duplicates in vessel detection train dataset #127

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Near duplicates in vessel detection train dataset #127

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions