Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Change id hash map #5304

Merged
merged 6 commits into from
Feb 21, 2023

Conversation

peizhou001
Copy link
Collaborator

@peizhou001 peizhou001 commented Feb 16, 2023

Description

As the concurrent id hash map is mainly used in to_block to map an id array to a new contiguous one, and current solution doesn't ensure the mapping order. While a requirement by it is to map first Nth unique seed nodes to 0~N. This PR change the id hash map to meet the requirement.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 16, 2023

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@peizhou001 peizhou001 changed the title Refactor id hash map [Enhancement]Refactor id hash map Feb 16, 2023
@peizhou001 peizhou001 self-assigned this Feb 16, 2023
@peizhou001 peizhou001 added the topic: system performance Issues about DGL system performance (e.g., speed, memory efficiency) label Feb 16, 2023
@peizhou001 peizhou001 changed the title [Enhancement]Refactor id hash map [Enhancement] Change id hash map Feb 16, 2023
@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 16, 2023

Commit ID: abd5e5eaf3021519e5da17310e81a3a4b154def1

Build ID: 1

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 16, 2023

Commit ID: 85a12cc16f348f2f07fd360061bbf40ee051b6d5

Build ID: 2

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@peizhou001 peizhou001 marked this pull request as ready for review February 17, 2023 04:24
@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 17, 2023

Commit ID: 78b467807c46bf3d72fa75a218b07118850e5afa

Build ID: 3

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 17, 2023

Commit ID: 172e7ef809b84a4d4710856d829476797709eb01

Build ID: 4

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

memset(hash_map_.get(), -1, sizeof(Mapping) * capacity);

// This code block is to fill the ids into hash_map_.
IdArray unique_ids = NewIdArray(num_ids, ctx, sizeof(IdType) * 8);
IdType* unique_ids_data = unique_ids.Ptr<IdType>();
// Fill in the first `num_seeds` ids.
parallel_for(0, num_seeds, kGrainSize, [&](int64_t s, int64_t e) {
Copy link
Collaborator

@frozenbugs frozenbugs Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the common scale of num_seeds? Since kGrainSize is 256 already, do we need to use parallel?

Copy link
Collaborator Author

@peizhou001 peizhou001 Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scale depends on your fan-out, which usually about 1/10 of the original nodes. When input nodes is huge, it could be also very large. And parallel doesn't introduce side effects, so keep it here should be better.

// Fill in the first `num_seeds` ids.
parallel_for(0, num_seeds, kGrainSize, [&](int64_t s, int64_t e) {
for (int64_t i = s; i < e; i++) {
InsertAndSet(ids_data[i], static_cast<IdType>(i));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why for seed ids we don't use AttemptInsertAt?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Seed ids mapping value is exactly its index in the array, so the key and value need to be set at the same time.
  2. Seed ids is unique so the insertion is simpler and some checks can be removed to save efforts.

*
* For example, for an array A with following entries:
* [98, 98, 100, 99, 97, 99, 101, 100, 102]
* For example, for an array A having 4 seed ids with following entries:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what are the seed ids?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, added.

* And then insert the items in `ids` concurrently to generate the
* mappings, in passing returning the unique ids in `ids`.
* @brief Initialize the hashmap with an array of ids. The first `num_seeds`
* ids are unqiue and must be mapped to a contiguous array starting
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unqiue -> unique

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* For example, for an array A with following entries:
* [98, 98, 100, 99, 97, 99, 101, 100, 102]
* For example, for an array A having 4 seed ids with following entries:
* [99, 98, 100, 97, 97, 101, 101, 102, 101]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the comment below you mentioned num_seeds ids are unique, I am assuming the first 4 are seed ids, but I see duplicated 97 in this example, it this intended?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Seed ids is unique among themselves, but it can be duplicate with other ids. So put it here may help user clarify it.

* mapped to [0, num_seed_ids) and `left ids` to [num_seed_ids, num_unique_ids).
* Notice that mapping order is stable for `seed ids` while not for the left.
* divided into 2 parts: [`seed ids`, `left ids`]. `Seed ids` refer to
* a set ids chosen as the input for sampling process and `left ids` are the
Copy link
Collaborator

@frozenbugs frozenbugs Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left ids -> sampled ids.

sampled ids are the ids new sampled from the process (note the the seed ids might be sampled in the process, but not included in the sampled ids to avoid duplication).

Copy link
Collaborator Author

@peizhou001 peizhou001 Feb 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good description, adopt it in the notes.
One small correction is seed ids can also be included in the sampled ids.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 20, 2023

Commit ID: edeccf11f8f839f33bd319c64bad5a1f67a3645d

Build ID: 5

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 20, 2023

Commit ID: cf9830eb1fc83755d827891f0248d97f04485319

Build ID: 6

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@@ -111,7 +111,9 @@ IdArray ConcurrentIdHashMap<IdType>::Init(
parallel_for(num_seeds, num_ids, kGrainSize, [&](int64_t s, int64_t e) {
size_t count = 0;
for (int64_t i = s; i < e; i++) {
Insert(ids_data[i], &valid, i);
if (Insert(ids_data[i])) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming each i will only be accessed once:
It can be simplified to:
valid[i] = Insert(ids_data[i]);
count += valid[i];

This is actually better since the existing code in L107 assumes the valid will be initiated to 0, which might not be true for all c++ compiler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. changed to this style.

/**
* @brief The result state of an attempt to insert.
*/
enum class InsertState {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to private section?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@dgl-bot
Copy link
Collaborator

dgl-bot commented Feb 21, 2023

Commit ID: e2976db

Build ID: 8

Status: ✅ CI test succeeded

Report path: link

Full logs path: link
Note: A new CI run will cancel previous CI runs, but an incorrect "success"
status might be shown for the previous runs. Please double check the report
before merging the PR.

@peizhou001 peizhou001 merged commit ed2e540 into dmlc:master Feb 21, 2023
@peizhou001 peizhou001 deleted the peizhou/changeidhashmap branch February 21, 2023 08:47
paoxiaode pushed a commit to paoxiaode/dgl that referenced this pull request Mar 24, 2023
* change concurrent id hash map
DominikaJedynak pushed a commit to DominikaJedynak/dgl that referenced this pull request Mar 12, 2024
* change concurrent id hash map
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: system performance Issues about DGL system performance (e.g., speed, memory efficiency)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants