Skip to content

ElasticSearch index bugs #210

@sbearcsiro

Description

@sbearcsiro

The ElasticSearchService has two ways of indexing an image, one is an individual record (ElasticSearchService.indexImage) and the second is a bulk index of all records in the database (ScheduleReindexAllImagesTask using ImageService.exportIndexToFile and ElasticSearchService.bulkIndexImageInES). The ES documents these produce are inconsistent in the fields they provide, the data types of some of those fields (ie bulkIndex all fields are strings, individual doc has ints for width, height, etc) and the name of some fields (contentmd5hash, contentsha1hash in bulk index vs contentMD5Hash, contentSHA1Hash in individual index).

I suggest updating the bulk index to match the individual index in fields, field names and data types.

Something something well defined schemas.

Example indexImage document:

{
  "imageIdentifier" : "d7db130f-3416-430d-acb8-dca966f61a9e",
  "contentMD5Hash" : "97ab347bee9ed365963ea1eebd402e3c",
  "contentSHA1Hash" : "00e354b58a4c5e3baf5d8e69ef0ff823414410ec",
  "format" : "image/jpeg",
  "originalFilename" : "https://inaturalist-open-data.s3.amazonaws.com/photos/218415139/original.jpeg",
  "extension" : "jpeg",
  "dateUploaded" : "2022-08-01T10:41:56Z",
  "dateTaken" : "2022-08-01T10:41:56Z",
  "fileSize" : 785423,
  "height" : 2048,
  "width" : 1365,
  "zoomLevels" : 5,
  "dataResourceUid" : "dr1411",
  "creator" : "Grace Keast",
  "title" : null,
  "description" : null,
  "rights" : null,
  "rightsHolder" : "Grace Keast",
  "license" : "http://creativecommons.org/licenses/by-nc/4.0/",
  "thumbHeight" : 300,
  "thumbWidth" : 200,
  "harvestable" : false,
  "recognisedLicence" : "CC BY-NC 4.0",
  "occurrenceID" : null,
  "dateUploadedYearMonth" : "2022-08",
  "fileType" : "image",
  "imageSize" : "2m"
}

Example bulkIndexImageInES document:

{
  "imageIdentifier" : "277e29e6-eea0-454d-a81c-4d90d374a72a",
  "contentmd5hash" : "866ff2eeebf50518c2f25b19cdf7645a",
  "contentsha1hash" : "fc1706d73208d297dd83820132627a56312edb24",
  "format" : "image/jpeg",
  "originalfilename" : "https://static.inaturalist.org/photos/32546837/original.jpg",
  "extension" : "jpg?1552086948",
  "dateUploaded" : "2019-11-15",
  "dateTaken" : "2019-11-15",
  "fileSize" : "958021",
  "height" : "1360",
  "width" : "2048",
  "zoomLevels" : "5",
  "dataResourceUid" : "dr1411",
  "creator" : "Rolf Lawrenz",
  "rightsHolder" : "Rolf Lawrenz",
  "license" : "http://creativecommons.org/licenses/by/4.0/",
  "thumbHeight" : "199",
  "thumbWidth" : "300",
  "harvestable" : "false",
  "occurrenceID" : "4e48e22f-b9c6-494b-bb9d-0db9f621548b",
  "type" : "StillImage",
  "created" : "2019-03-06T12:36:50-08:00",
  "references" : "https://www.inaturalist.org/photos/32546837",
  "dateUploadedYearMonth" : "2019-11",
  "fileType" : "image",
  "recognisedLicence" : "unrecognised_licence",
  "imageSize" : "2m"
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions