Skip to content

Commit

Permalink
Respect non-ASCII identifiers in sanitization for clearer names
Browse files Browse the repository at this point in the history
See also Rust RFC 2457: rust-lang/rfcs#2457
  • Loading branch information
evolutics committed May 24, 2021
1 parent 4ad26e5 commit 04e8fc0
Show file tree
Hide file tree
Showing 5 changed files with 92 additions and 29 deletions.
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,14 @@ Notable library changes are documented here in a format based on

## Unreleased

### Changed

- Respect Unicode identifiers in
[name sanitization](https://github.com/evolutics/iftree#name-sanitization).
If you only use ASCII file paths, then this change has no effect. Essentially,
non-ASCII characters that are valid in identifiers (from Rust 1.53.0) are
preserved instead of replaced by an underscore `"_"`.

## 0.1.1 – 2021-05-14

### Fixed
Expand Down
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ quote = "1.0"
serde = { version = "1.0", features = ["derive"] }
syn = { version = "1.0", features = ["default", "extra-traits"] }
toml = "0.5"
unicode-xid = "0.2"

[dev-dependencies]
actix-web = "3.3"
Expand Down
27 changes: 19 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,20 +163,31 @@ See

### Name sanitization

When generating identifiers based on paths, names are sanitized as follows to
ensure they are
[valid identifiers](https://doc.rust-lang.org/reference/identifiers.html):

- Characters other than ASCII alphanumericals are replaced by `"_"`.
- If the first character is numeric, then `"_"` is prepended.
- If the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`, then `"_"`
is appended.
When generating identifiers based on paths, names are sanitized. For example, a
folder name `.my-assets` is sanitized to an identifier `_my_assets`.

The sanitization process is designed to generate valid
[Unicode identifiers](https://doc.rust-lang.org/nightly/reference/identifiers.html).
Essentially, it replaces invalid identifier characters by underscores `"_"`.
More precisely:

1. Characters without the property `XID_Continue` are replaced by `"_"`. The set
of `XID_Continue` characters in ASCII is `[0-9A-Z_a-z]`.
1. Next, if the first character does not have the property `XID_Start`, then
`"_"` is prepended unless the first character is already `"_"`. The set of
`XID_Start` characters in ASCII is `[A-Za-z]`.
1. Finally, if the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`,
then `"_"` is appended.

Names are further adjusted to respect naming conventions in the default case:

- Lowercase for folders (because they map to module names).
- Uppercase for filenames (because they map to static variables).

Note that non-ASCII identifiers are only supported from Rust 1.53.0. For earlier
versions, the sanitization here may generate invalid identifiers if you use
non-ASCII paths, in which case you need to manually rename the affected files.

### Portable file paths

To prevent issues when developing on different platforms, any paths in your
Expand Down
58 changes: 45 additions & 13 deletions src/generate_view/sanitize_name.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ fn sanitize_by_convention(name: &str, convention: Convention) -> String {
fn sanitize_special_characters(name: &str) -> String {
name.chars()
.map(|character| {
if character.is_ascii_alphanumeric() {
if unicode_xid::UnicodeXID::is_xid_continue(character) {
character
} else {
'_'
Expand All @@ -32,14 +32,14 @@ fn sanitize_special_characters(name: &str) -> String {

fn sanitize_first_character(name: String) -> String {
match name.chars().next() {
Some(first_character) if first_character.is_numeric() => format!("_{}", name),
_ => name,
Some(first_character) if unicode_xid::UnicodeXID::is_xid_start(first_character) => name,
Some('_') => name,
_ => format!("_{}", name),
}
}

fn sanitize_special_cases(name: String) -> String {
match name.as_ref() {
"" => String::from("__"),
"_" | "crate" | "self" | "Self" | "super" => format!("{}_", name),
_ => name,
}
Expand All @@ -60,33 +60,65 @@ mod tests {

#[test]
fn handles_convention_of_screaming_snake_case() {
let actual = main("README.md", Convention::ScreamingSnakeCase);
let actual = main("README_ß_ʼn.md", Convention::ScreamingSnakeCase);

let expected = quote::format_ident!("r#README_MD");
let expected = quote::format_ident!("r#README_SS_ʼN_MD");
assert_eq!(actual, expected);
}

#[test]
fn handles_convention_of_snake_case() {
let actual = main("README.md", Convention::SnakeCase);
let actual = main("README_ß_ʼn.md", Convention::SnakeCase);

let expected = quote::format_ident!("r#readme_md");
let expected = quote::format_ident!("r#readme_ß_ʼn_md");
assert_eq!(actual, expected);
}

#[test]
fn handles_special_characters() {
let actual = main("A B##C_D±EÅF𝟙G.H", Convention::ScreamingSnakeCase);
let actual = main("_0 1##2$3±4√5👽6.7", stubs::convention());

let expected = quote::format_ident!("r#_0_1__2_3_4_5_6_7");
assert_eq!(actual, expected);
}

#[test]
fn handles_non_ascii_identifiers() {
let actual = main("åb_π_𝟙", Convention::SnakeCase);

let expected = quote::format_ident!("r#åb_π_𝟙");
assert_eq!(actual, expected);
}

#[test]
fn handles_first_character_if_xid_start() {
let actual = main("a", Convention::SnakeCase);

let expected = quote::format_ident!("r#a");
assert_eq!(actual, expected);
}

#[test]
fn handles_first_character_if_underscore() {
let actual = main("_2", stubs::convention());

let expected = quote::format_ident!("r#_2");
assert_eq!(actual, expected);
}

#[test]
fn handles_first_character_if_xid_continue_but_not_xid_start() {
let actual = main("3", stubs::convention());

let expected = quote::format_ident!("r#A_B__C_D_E_F_G_H");
let expected = quote::format_ident!("r#_3");
assert_eq!(actual, expected);
}

#[test]
fn handles_first_character() {
let actual = main("2a", Convention::SnakeCase);
fn handles_first_character_if_not_xid_continue() {
let actual = main(".4", stubs::convention());

let expected = quote::format_ident!("r#_2a");
let expected = quote::format_ident!("r#_4");
assert_eq!(actual, expected);
}

Expand Down
27 changes: 19 additions & 8 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -161,20 +161,31 @@
//!
//! ## Name sanitization
//!
//! When generating identifiers based on paths, names are sanitized as follows to
//! ensure they are
//! [valid identifiers](https://doc.rust-lang.org/reference/identifiers.html):
//!
//! - Characters other than ASCII alphanumericals are replaced by `"_"`.
//! - If the first character is numeric, then `"_"` is prepended.
//! - If the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`, then `"_"`
//! is appended.
//! When generating identifiers based on paths, names are sanitized. For example, a
//! folder name `.my-assets` is sanitized to an identifier `_my_assets`.
//!
//! The sanitization process is designed to generate valid
//! [Unicode identifiers](https://doc.rust-lang.org/nightly/reference/identifiers.html).
//! Essentially, it replaces invalid identifier characters by underscores `"_"`.
//! More precisely:
//!
//! 1. Characters without the property `XID_Continue` are replaced by `"_"`. The set
//! of `XID_Continue` characters in ASCII is `[0-9A-Z_a-z]`.
//! 1. Next, if the first character does not have the property `XID_Start`, then
//! `"_"` is prepended unless the first character is already `"_"`. The set of
//! `XID_Start` characters in ASCII is `[A-Za-z]`.
//! 1. Finally, if the name is `"_"`, `"crate"`, `"self"`, `"Self"`, or `"super"`,
//! then `"_"` is appended.
//!
//! Names are further adjusted to respect naming conventions in the default case:
//!
//! - Lowercase for folders (because they map to module names).
//! - Uppercase for filenames (because they map to static variables).
//!
//! Note that non-ASCII identifiers are only supported from Rust 1.53.0. For earlier
//! versions, the sanitization here may generate invalid identifiers if you use
//! non-ASCII paths, in which case you need to manually rename the affected files.
//!
//! ## Portable file paths
//!
//! To prevent issues when developing on different platforms, any paths in your
Expand Down

0 comments on commit 04e8fc0

Please sign in to comment.