text-preprocessing-resumes

Description

This notebook preprocesses a set of resume files and generates a sparse representation of the resumes. The dataset contains 250 resumes, which includes information of the applicants like contact details, skills, work experience, educational background, hobbies, etc.

The resumes containting the information of the applicants were loaded and any duplicate resumes were excluded from the analysis. Text pre-processing was performed on the resume files in the dataset. The pre-processing included case normalization, tokenization, removal of stopwords and stemming. Additionally, the most frequent and rare tokens were removed, the tokens with length less than 3 were removed, and the first 200 meaningful bigrams were added to the vocabulary. The capital tokens appearing in the middle of a sentence/line were not normalized to lower case because of the hypothesis that these tokens are likely to have a different meaning than their lower case counterparts.

The notebook concludes with the generation of the lexical vocabulary and the count vector sparse representations of the resumes. The vocabulary and vector representations are written to files, which can be used as input to various recommender-systems and information retrieval algorithms.

Packages required

nltk (https://pypi.org/project/nltk/)
numpy (https://pypi.org/project/numpy/)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Resumes		Resumes
README.md		README.md
pre-processing-resumes.ipynb		pre-processing-resumes.ipynb
stopwords_en.txt		stopwords_en.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text-preprocessing-resumes

Description

Packages required

About

Languages

saahilbhatia/text-preprocessing-resumes

Folders and files

Latest commit

History

Repository files navigation

text-preprocessing-resumes

Description

Packages required

About

Topics

Resources

Stars

Watchers

Forks

Languages