Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing fix() #488

Open
anirudhgangwal opened this issue Jun 8, 2023 · 2 comments
Open

Enhancing fix() #488

anirudhgangwal opened this issue Jun 8, 2023 · 2 comments
Assignees

Comments

@anirudhgangwal
Copy link

I am implementing a Python version of the library for my own use-case - https://github.com/anirudhgangwal/ukpostcodes. The library mimics functionalities available here, including lookup in ONS database (but I don't use a DB/api to postcode.io, just have a set of ~1.8M postcodes).

We parse postcodes from OCR output and the "O" and "I" errors account for almost all our errors. The fix implemented here was helpful in reducing our error significantly. However, I want to understand if there was a reason to not expand this auto-correct further.

Lets take the example of a 3 digit outcode. This can take the following forms:
A9A 9AA
A99 9AA
AA9 9AA

Since the second and third characters can take on both letters or numbers, this library currently only coerces for "L??".

I think there is a possibility to add a new function, or a parameter to function, which returns a list. E.g.

fix(OOO 4SS) => ["O00 4SS", "OO0 4SS", "O0O 4SS"] # try LLN, LNN, and LNL

A quick Python implementation looked like this:

def fix_with_options(s: str) -> List[str]:
    """Attempts to fix a given postcode, covering all options.

    Args:
        s (str): The postcode to fix
    Returns:
        str: The fixed postcode
    """
    if not FIXABLE_REGEX.match(s):
        return s
    s = s.upper().strip().replace(r"\s+", "")
    inward = s[-3:].strip()
    outward = s[:-3].strip()
    outcode_options = coerce_outcode_with_options(outward)
    return [
        f"{coerce_outcode(option)} {coerce_incode(inward)}"
        for option in outcode_options
    ]

def coerce_outcode_with_options(i: str) -> List[str]:
    """Coerce outcode, but cover all possibilities"""
    if len(i) == 2:
        return [coerce("LN", i)]
    elif len(i) == 3:
        outcodes = []
        if is_valid_outcode(outcode := coerce("LNL", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LNN", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LLN", i)):
            outcodes.append(outcode)
        return list(set(outcodes))
    elif len(i) == 4:
        outcodes = []
        if is_valid_outcode(outcode := coerce("LLNL", i)):
            outcodes.append(outcode)
        if is_valid_outcode(outcode := coerce("LLNN", i)):
            outcodes.append(outcode)
        return list(set(outcodes))
    else:
        return [i]

This reduced our error rate further down (significantly as most errors were with misreading 0). Note for our use case did made sense as after checking with ONS directory there were negligible false positives.

@cblanc
Copy link
Member

cblanc commented Jun 10, 2023

Thanks we'll take a look. CC'ing @mfilip

@cblanc cblanc self-assigned this Jun 10, 2023
@mfilip
Copy link
Member

mfilip commented Jun 21, 2023

Hey @anirudhgangwal it is nice approach but to implement to our lib we will need to break our interface pattern to return array of possible fixes when this is not indent for this simple lib. We see possible use cases for array but this lib is intend to just fix numeric mistake and return generally valid postcode.

A9A 9AA
A99 9AA
AA9 9AA

All of those are valid postcodes in it's construction. So our lib just trying to fix those not matching it so pattern L?? is sufficient to cover all of those. If your intend is to use it after for check in db your version will give you less errors and additional possibilities of fixes which is great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants