Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve unicode property tokens #56

Merged
merged 2 commits into from
Aug 28, 2018
Merged

Conversation

jaynetics
Copy link
Collaborator

Aims

  • make property tokens consistent
    • remove inconsistently applied abbreviations, reversals, omissions and substitutions
    • remove case-specific, redundant affixes such as 'script_'
      • the same information is available via the Token::Map structure and Parser emissions
  • make property tokens reverse-compatible, i.e. compatible with /\p{#{token}}/
  • make them standard-like (e.g. googleable)
  • allow auto-generating a property map for the scanner
    • ensure support of all current properties
    • add support for future properties without major effort
  • allow automatically determining properties that are missing from the Token::Map

Changes to token names

old pattern new pattern old example new example
*_any * number_any number
*_cp *_code_point non_character_cp noncharacter_code_point
age_*_* age=*.* age_6_0 age=6.0
ascii_hex ascii_hex_digit ascii_hex ascii_hex_digit
block_in* in_* block_inadlam in_adlam
*_extended *_extend other_grapheme_extend other_grapheme_extended
ids_*_op ids_*_operator ids_binary_op ids_binary_operator
letter_* *_letter letter_other other_letter
mark_* *_mark mark_other other_mark
number_* *_number number_other other_number
punct_* *_punctuation punct_dash dash_punctuation
script_* * script_thai thai
separator_* *_separator separator_other other_separator
symbol_* *_symbol symbol_other other_symbol
*para* *paragraph* separator_para paragraph_separator
*whitespace *white_space pattern_whitespace pattern_white_space

Codemod script

def normalize_old_rp_property_token(token)
  token
    .to_s
    .sub(/^age_(\d+)_(\d+)$/, 'age=\1.\2')
    .sub(/^ascii_hex$/, 'ascii_hex_digit')
    .sub(/^block_in/, 'in_')
    .sub(/^non_character/, 'noncharacter')
    .sub(/_cp$/, '_code_point')
    .sub(/^ids_(.+)_op$/, 'ids_\1_operator')
    .sub(/^other_grapheme_extended$/, 'other_grapheme_extend')
    .sub(/^punct_(.+)$/, '\1_punctuation')
    .sub(/^script_/, '')
    .sub(/^separator_para$/, 'paragraph_separator')
    .sub(/^(letter|number|mark|separator|symbol)_(.+)$/, '\2_\1')
    .sub(/^any_/, '')
    .sub(/whitespace$/, 'white_space')
    .to_sym
end

@ammar
Copy link
Owner

ammar commented Aug 26, 2018

@janosch-x Sorry again for the very late response.

I like these changes a lot. Even if you decide to fork for future work and maintenance I would like to merge this here too. The breaking changes are fairly easy to make. The users I know of, like twitter-cldr-rb (if they still do), fix the version they depend on so there should no surprises for them.

Thanks!

@jaynetics jaynetics merged commit fde1aa1 into master Aug 28, 2018
@jaynetics jaynetics deleted the improve_property_handling branch August 28, 2018 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants