Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wasm] Enable ICU sharding #51665

Closed
wants to merge 19 commits into from
Closed

[wasm] Enable ICU sharding #51665

wants to merge 19 commits into from

Conversation

tqiu8
Copy link
Contributor

@tqiu8 tqiu8 commented Apr 22, 2021

Initial work to enable ICU sharding according to locale and features.

  • Added --enable-sharding flag in runtime-test.js to toggle feature on and off.
  • Test suites currently do not run with sharding enabled.
  • Added switches between setAppData and setCommonData, which allows shards like icudt_CJK.dat and icudt_EFIGS.dat to be loaded simultaneously (needs more investigating).

Current size differences:

Current Data Files Size (Compressed) Proposed Shard Files Aggregate Sizes (Compressed)
icudt.dat 384 icudt_base.dat
icudt_currency.dat
icudt_normalization.dat
icudt_coll.dat
icudt_locales.dat
368
icudt_EFIGS.dat 192 icudt_base.dat
icudt_efigs_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_efigs_coll.dat
172
icudt_CJK.dat 256 icudt_base.dat
icudt_cjk_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_cjk_coll.dat
304
icudt_no_CJK.dat 256 icudt_base.dat
icudt_no_cjk_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_no_cjk_coll.dat
252

We see size gains in EFIGS in no_CJK, but size increases in CJK due to increased size of collation data.

Currently references icu_dictionary.json which is a mapping between culture and relevant files (generated by in dotnet/icu#104).

Related to #49220 and #49221

Additional documentation can be found here.

@ghost
Copy link

ghost commented Apr 22, 2021

Tagging subscribers to this area: @tarekgh, @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Initial work to enable ICU sharding according to locale and features.

  • Added --enable-sharding flag in runtime-test.js to toggle feature on and off.
  • Test suites currently do not run with sharding enabled.
  • Added switches between setAppData and setCommonData, which allows shards like icudt_CJK.dat and icudt_EFIGS.dat to be loaded simultaneously (needs more investigating).

Current size differences:

Current Data Files Size (Compressed) Proposed Shard Files Aggregate Sizes (Compressed)
icudt.dat 384 icudt_base.dat
icudt_currency.dat
icudt_normalization.dat
icudt_coll.dat
icudt_locales.dat
368
icudt_EFIGS.dat 192 icudt_base.dat
icudt_efigs_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_efigs_coll.dat
172
icudt_CJK.dat 256 icudt_base.dat
icudt_cjk_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_cjk_coll.dat
304
icudt_no_CJK.dat 256 icudt_base.dat
icudt_no_cjk_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_no_cjk_coll.dat
252

We see size gains in EFIGS in no_CJK, but size increases in CJK due to increased size of collation data.

Currently references icu_dictionary.json which is a mapping between culture and relevant files (generated by in dotnet/icu#).

Related to #49220 and #49221

Additional documentation can be found here.

Author: tqiu8
Assignees: -
Labels:

area-System.Globalization

Milestone: -

@tqiu8 tqiu8 mentioned this pull request Apr 22, 2021
5 tasks
@@ -146,7 +154,7 @@ int32_t GlobalizationNative_LoadICUData(const char* path)

fclose(fp);

if (load_icu_data(icu_data) == 0) {
if (load_icu_data(icu_data, strcasecmp("icudt.dat", path)) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this always be named icudt.dat? I think in some cases the platform native .dat file might have a version number in its path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rename all the shards to something like .app.dat and assume anything that doesn't match that pattern should use setCommonData?

src/mono/wasm/build/WasmApp.targets Outdated Show resolved Hide resolved
src/tasks/WasmAppBuilder/WasmAppBuilder.cs Outdated Show resolved Hide resolved
src/mono/wasm/build/WasmApp.targets Outdated Show resolved Hide resolved
src/tasks/WasmAppBuilder/WasmAppBuilder.cs Outdated Show resolved Hide resolved
src/tasks/WasmAppBuilder/WasmAppBuilder.cs Outdated Show resolved Hide resolved
@marek-safar
Copy link
Contributor

Could you add test(s) that the feature works as expected?

@tarekgh tarekgh requested a review from safern April 22, 2021 15:51
@CoffeeFlux
Copy link
Contributor

Do you know why the size of the collation data is increasing? Also, how hard would it be to split up locales further as a followup (and how much potential size would we save by doing so)?

{

UErrorCode status = 0;
udata_setCommonData(pData, &status);
if (type == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give this enum some names? That way it reads better.

@tqiu8
Copy link
Contributor Author

tqiu8 commented Apr 23, 2021

Do you know why the size of the collation data is increasing? Also, how hard would it be to split up locales further as a followup (and how much potential size would we save by doing so)?

Not completely sure....Splitting up the collation by locales on the EFIGS does not make a huge difference, but might make a difference in CJK.

Copy link
Member

@lewing lewing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the specifics of the file names and the shard resolution javascript to dotnet/icu (see my comment there) and then just import the props and .js here?

Copy link
Member

@lewing lewing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move the specifics of the file names and the shard resolution javascript to dotnet/icu (see my comment there) and then just import the props and .js here?

@@ -146,7 +154,7 @@ int32_t GlobalizationNative_LoadICUData(const char* path)

fclose(fp);

if (load_icu_data(icu_data) == 0) {
if (load_icu_data(icu_data, strcasecmp("icudt.dat", path)) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should rename all the shards to something like .app.dat and assume anything that doesn't match that pattern should use setCommonData?

@maryamariyan maryamariyan added the arch-wasm WebAssembly architecture label May 20, 2021
@ghost
Copy link

ghost commented May 20, 2021

Tagging subscribers to 'arch-wasm': @lewing
See info in area-owners.md if you want to be subscribed.

Issue Details

Initial work to enable ICU sharding according to locale and features.

  • Added --enable-sharding flag in runtime-test.js to toggle feature on and off.
  • Test suites currently do not run with sharding enabled.
  • Added switches between setAppData and setCommonData, which allows shards like icudt_CJK.dat and icudt_EFIGS.dat to be loaded simultaneously (needs more investigating).

Current size differences:

Current Data Files Size (Compressed) Proposed Shard Files Aggregate Sizes (Compressed)
icudt.dat 384 icudt_base.dat
icudt_currency.dat
icudt_normalization.dat
icudt_coll.dat
icudt_locales.dat
368
icudt_EFIGS.dat 192 icudt_base.dat
icudt_efigs_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_efigs_coll.dat
172
icudt_CJK.dat 256 icudt_base.dat
icudt_cjk_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_cjk_coll.dat
304
icudt_no_CJK.dat 256 icudt_base.dat
icudt_no_cjk_locales.dat
icudt_currency.dat
icudt_normalization.dat
icudt_no_cjk_coll.dat
252

We see size gains in EFIGS in no_CJK, but size increases in CJK due to increased size of collation data.

Currently references icu_dictionary.json which is a mapping between culture and relevant files (generated by in dotnet/icu#104).

Related to #49220 and #49221

Additional documentation can be found here.

Author: tqiu8
Assignees: -
Labels:

arch-wasm, area-System.Globalization

Milestone: -

@@ -178,7 +178,7 @@ run-tests-%:
EMSDK_PATH=$(EMSDK_PATH) PATH="$(JSVU):$(PATH)" $(DOTNET) build $(TOP)/src/libraries/$*/tests/ /t:Test $(_MSBUILD_WASM_BUILD_ARGS) $(MSBUILD_ARGS)

run-build-tests:
PATH="$(JSVU):$(PATH)" $(DOTNET) build $(TOP)/src/tests/BuildWasmApps/Wasm.Build.Tests/ /t:Test $(_MSBUILD_WASM_BUILD_ARGS) $(MSBUILD_ARGS)
PATH="$(JSVU):$(PATH)" $(DOTNET) build $(TOP)/src/tests/BuildWasmApps/Wasm.Build.Tests/ /t:Test $(_MSBUILD_WASM_BUILD_ARGS) $(MSBUILD_ARGS) /p:XUnitMethodName=Wasm.Build.Tests.InvariantGlobalizationTests.Invariant_WithSharding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an intentional change to check in?

@safern
Copy link
Member

safern commented Jun 17, 2021

@tqiu8 What is needed to move this PR forward?

@lewing
Copy link
Member

lewing commented Jun 17, 2021

@safern it is blocked on me at the moment, @tqiu8 has moved on to greener pastures.

@lewing lewing self-assigned this Jun 17, 2021
@maryamariyan
Copy link
Member

@lewing the PR seems to have conflicts, I'll close the PR.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 7, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants