Improve unicode property tokens #56

jaynetics · 2018-05-14T14:59:21Z

Aims

make property tokens consistent
- remove inconsistently applied abbreviations, reversals, omissions and substitutions
- remove case-specific, redundant affixes such as 'script_'
  - the same information is available via the Token::Map structure and Parser emissions
make property tokens reverse-compatible, i.e. compatible with /\p{#{token}}/
make them standard-like (e.g. googleable)
allow auto-generating a property map for the scanner
- ensure support of all current properties
- add support for future properties without major effort
allow automatically determining properties that are missing from the Token::Map

Changes to token names

old pattern	new pattern	old example	new example
*_any	*	number_any	number
*_cp	*_code_point	non_character_cp	noncharacter_code_point
age__	age=.	age_6_0	age=6.0
ascii_hex	ascii_hex_digit	ascii_hex	ascii_hex_digit
block_in*	in_*	block_inadlam	in_adlam
*_extended	*_extend	other_grapheme_extend	other_grapheme_extended
ids_*_op	ids_*_operator	ids_binary_op	ids_binary_operator
letter_*	*_letter	letter_other	other_letter
mark_*	*_mark	mark_other	other_mark
number_*	*_number	number_other	other_number
punct_*	*_punctuation	punct_dash	dash_punctuation
script_*	*	script_thai	thai
separator_*	*_separator	separator_other	other_separator
symbol_*	*_symbol	symbol_other	other_symbol
para	paragraph	separator_para	paragraph_separator
*whitespace	*white_space	pattern_whitespace	pattern_white_space

Codemod script

def normalize_old_rp_property_token(token)
  token
    .to_s
    .sub(/^age_(\d+)_(\d+)$/, 'age=\1.\2')
    .sub(/^ascii_hex$/, 'ascii_hex_digit')
    .sub(/^block_in/, 'in_')
    .sub(/^non_character/, 'noncharacter')
    .sub(/_cp$/, '_code_point')
    .sub(/^ids_(.+)_op$/, 'ids_\1_operator')
    .sub(/^other_grapheme_extended$/, 'other_grapheme_extend')
    .sub(/^punct_(.+)$/, '\1_punctuation')
    .sub(/^script_/, '')
    .sub(/^separator_para$/, 'paragraph_separator')
    .sub(/^(letter|number|mark|separator|symbol)_(.+)$/, '\2_\1')
    .sub(/^any_/, '')
    .sub(/whitespace$/, 'white_space')
    .to_sym
end

ammar · 2018-08-26T11:49:15Z

@janosch-x Sorry again for the very late response.

I like these changes a lot. Even if you decide to fork for future work and maintenance I would like to merge this here too. The breaking changes are fairly easy to make. The users I know of, like twitter-cldr-rb (if they still do), fix the version they depend on so there should no surprises for them.

Thanks!

jaynetics added 2 commits May 14, 2018 16:54

Auto-generate property tokens, test and support all properties

781e34a

Prepare ChangeLog entry [ci skip]

1586f23

jaynetics merged commit fde1aa1 into master Aug 28, 2018

jaynetics deleted the improve_property_handling branch August 28, 2018 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve unicode property tokens #56

Improve unicode property tokens #56

jaynetics commented May 14, 2018

ammar commented Aug 26, 2018

Improve unicode property tokens #56

Improve unicode property tokens #56

Conversation

jaynetics commented May 14, 2018

Aims

Changes to token names

Codemod script

ammar commented Aug 26, 2018