Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for GET_HOST_CATEGORIES function #6

Merged
merged 7 commits into from
Jul 8, 2024

Conversation

nrllh
Copy link
Contributor

@nrllh nrllh commented Jul 1, 2024

This pull request adds documentation for the GET_HOST_CATEGORIES function, including its description, usage, and methodology.

Copy link
Owner

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for documenting this! Just a few small comments.

# `GET_HOST_CATEGORIES` function

## Description
We are happy to release the largest open-source dataset of website categories, featuring 147 million hosts and 31 million domains, making it the most extensive open-source data available in this area.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about keeping the language more neutral for the technical reference docs? For a blog post style announcement we can also create a new page under the guides dir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also not sure about that, feel free to remove this sentence

### Raw Data
The raw data for the classifications is stored in the `httparchive.urls.categories` table. This table consists of pre-classified hostnames with their corresponding categories. The categories follow a hierarchical structure, providing both specific subcategories and broader parent categories.

Please consider the limitations of our method regarding some hosts discussed [here](https://github.com/HTTPArchive/httparchive.org/issues/868). Thus, while this data can be accessed directly, we highly recommend using the `GET_HOST_CATEGORIES` function due to the handling of hashed subdomains. If your analysis requires working with domains (e.g., `google.com` instead of `maps.google.com`), accessing the raw data directly is also appropriate.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also list some of these hosts in the Limitations section below, so do we need to link to the GH issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think linking to the issue makes sense for anyone interested in discovering the limitations more



## Methodology
The classification of hostnames is performed using Chrome's Topics API (currently `chrome5`) by [Nurullah Demir](https://ndemir.com) and [Yohan Beugin](https://yohan.beugin.org/) using [this repository](https://github.com/yohhaan/httparchive-topics-classification). The taxonomy used is version 2, which consists of 469 categories as defined in the [Topics API taxonomy v2](https://github.com/patcg-individual-drafts/topics/blob/a116e9e404dd96c7793fe6542c53e9a9d93cb75b/taxonomy_v2.md).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include more info about how recently the chrome5 model was trained?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense. It's currently the latest model shipped in Chrome for topics.

nrllh and others added 6 commits July 4, 2024 12:59
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
Co-authored-by: Rick Viscomi <rviscomi@users.noreply.github.com>
@rviscomi
Copy link
Owner

rviscomi commented Jul 8, 2024

Thanks, I'll merge and open follow-up PRs for any updated.

@rviscomi rviscomi merged commit a80af55 into rviscomi:main Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants