Skip to content
/ TySug Public

A project around helping to prevent typing typos. TySug (Typo Suggestions) suggests alternative words with respect to keyboard layouts

License

Notifications You must be signed in to change notification settings

Dynom/TySug

Repository files navigation

TySug

CircleCI Go Report Card GoDoc codecov Mentioned in Awesome Go

TySug is collection of packages, together they form a keyboard layout aware alternative word suggester. It can be used as both a library and a webservice.

shcool

The primary supported use-case is to help with spelling mistakes against short popular word lists (e.g. domain names). Which is useful in helping to prevent typos in e.g. e-mail addresses, detect spam, phishing (Typosquatting), etc.

The goal is to provide an extensible library that helps with finding possible spelling errors. You can use it out-of-the-box as a library, a webservice or as a set of packages to build your own application.

Currently, it's a fairly naive approach and not (yet) backed by ML.

Using TySug

You can use TySug as stand-alone webservice to match against a known-list. If you have Docker you'll have it up and running in a few minutes.

TL;DR

If you have Docker installed, and you quickly want to tinker, just run:

docker run --rm -it dynom/tysug:latest

If you don't have Docker, you can download the binary from the releases page.

In a different terminal, run:

curl -s "http://127.0.0.1:1337/list/domains" --data-binary '{"input": "gmail.co"}'

As Webservice

curl -s "http://host:port/list/domains" --data-binary '{"input": "gmail.co"}' | jq .

{
  "result": "gmail.com",
  "score": 0.9777777777777777,
  "exact_match": false
}
  • The webservice uses Jaro-Winkler to calculate similarity.
  • The example uses jq, just omit it if you don't have it installed.

The path /list/< name >

The name corresponds with a list definition in the config.toml. Using this approach the service can be used for various types of data. This is both for efficiency (shorter lists to iterate over) and to be more opinionated. when no list by that name is found, a 404 is returned.

As a library

TySug is a collection of stand-alone packages. In each library you can find a README covering the details.

import "github.com/Dynom/TySug/finder"
referenceList := []string{"example", "amplifier", "ample"}
ts := finder.New(referenceList, finder.WithAlgorithm(myAlgorithm))

alt, score, exact := ts.Find("exampel")
// alt   = example
// score = 0.9714285714285714
// exact = false (not an exact match in our reference list)

Using a different algorithm

if you want to use a different algorithm, simply wrap your algorithm in a finder.Algorithm compatible type and pass it as argument to the Finder. You can find inspiration in the unit-tests / examples.

Possible considerations:

Sources:

Dealing with confidence

When adding your own algorithm, you'll need to handle the "confidence" element yourself. By default, TySug's finder will handle it just fine, but depending on the scale the algorithm uses you'll need to either normalize the scale or deal with the score.

Note: Be careful not to introduce bias when converting scale.

var someAlgorithm finder.AlgWrapper = func(a, b string) float64 {

    // Result is, in this instance, the amount of steps taken to achieve equality
    // Algorithms like Jaro produce a value between 0.0 and 1.0
    score := someAlgorithm.CalculateDistance(a, b)
    
    // Finding the longest string
    var ml = len(b)
    if len(a) >= len(b) {
        ml = len(a)
    }
    
    // This introduces a bias. Inputs of longer lengths get a slight favour over shorter ones, causing deletions to weigh less.
    return 1 - (score / float64(ml))
}

sug := finder.New([]list, finder.WithAlgorithm(someAlgorithm))
bestMatch, score := sug.Find(input)
// Here score might be 0.8 for a string of length 10, with 2 mutations

Without converting the scale, you'll have no bias, however you need to deal with a range where closer to 0 means fewer changes:

// This will produce a range from (-1 * maximumInputLength) to 0
return -1 * score

Details

Reference lists

The reference list is a list with known/approved words. TySug's webservice is not optimised to deal with large lists, instead it aims for "opinionated" lists. This way you can have a list of domain names or country names. This keeps the service snappy and less prone to false-positives.

Large is relative. The size is strongly related to the processing time, longer lists take more time O(N). Test and keep the list within your response-time limits :-).

Case-sensitivity

TySug does not normalise words. This means that words are treated in a case-sensitive matter. This is done mostly to avoid doing unnecessary work in the hot-path. Typically, you'll want to make sure both your lists and your input uses the same casing.

Ordering

The reference list order is significant. The first of an equal score wins the election. So you'll want to put more common, popular, etc. words first in the list.

Keyboard layout awareness