Skip to content

Tips on data collection

Alessandra edited this page Jul 10, 2024 · 1 revision

Tips on data collection

KDL encourages the use of open data formats and standards.

The size and nature of the data, its diversity, provenance, licensing, compatibility of the input format with our software stack and existing workflows, as well as the degree of certainty, fuzziness, and interpretation in the data are all factors that affect the analysis KDL conducts and our recommendations in terms of data collection. During feasibility assessment (see below) as well as after a project started, in depth conversations with partners will inform data collection.

However, if a project hasn’t started yet, there are some general rules we could point to follow when creating a dataset. While each research context requires bespoke analysis, here are some minimal tips KDL recommends you to follow:

Separate any fields you would like to conduct any analysis on (e.g. if surnames are a unit of analysis, make sure they are recorded in separate fields or columns distinct from first names). As a general rule, use one value per field; if multiple values, place them in separate fields or the comment field.

Be consistent with the format of the value in each field (e.g. dates); please use a clear convention for range of uncertainty and apply that consistently. Same applies to distinction between missing or ineligible values.

Minimise editorial interventions at collection stage unless agreed (e.g. don’t modernise names if later the analysis might need to track back to the original place or person names, but provide separate field(s) for modernisation to fill in at a later stage or during data collection as appropriate).

If entity identification is relevant, create a field to start matching and facilitate merging (e.g. people with same names recorded as the same person or a different one).

Use notes and comments fields to record uncertainty or any comments on data collections that could be relevant to refine the data model later on (e.g. indication of possible names to merge or to separate).

Make sure all records link with enough precision to the location (as locus of occurrence) in the source material if relevant

If the plan is to analyse the dataset via the creation of maps, record geo-coordinates for known places if feasible.

See also:

W3C Data on the Web Best Practices

Clone this wiki locally