Create words_alpha_clean.txt #108

Orivoir · 2021-10-02T05:25:46Z

Add file words_alpha_clean.txt that a copy of words_alpha.txt but the words that not exists in english has been removed.
The sort has been effectuate with the API of wordsapi that allow the search of words in english, from a script i've call the API for each word,
and during not exists word i've remove word from a file.
You can find the doc API here.
The exact filter of a word is based on frequency data of API

if(!!response.word && typeof response.frequency == "object") {
					
    if(response.frequency.perMillion >= 15) {
         // here word is not removed
	realWords.push(response.word);
     }
     // else word is removed
}

The documentation indicate this below text for frequency data:

This is the number of times the word is likely to appear in any English corpus, per million words.

Add file `words_alpha_clean.txt` that a of `words_alpha.txt` but the words that not exists in english has been removed. The sort has been effectuate with the `API` of [wordsapi](https://www.wordsapi.com/) that allow the search of words in english, from a script i've call the API for each word, and during not exists word i've remove word from a file. You can find the doc API [here](https://www.wordsapi.com/docs/). The exact filter of a word is based on `frequency` data of API ```javascript if(!!response.word && typeof response.frequency == "object") { if(response.frequency.perMillion >= 15) { // here word is not removed realWords.push(response.word); } // else word is removed } ``` The documentation indicate this below text for [frequency](https://www.wordsapi.com/docs/#frequency) data: > This is the number of times the word is likely to appear in any English corpus, per million words.

jcnmsg · 2021-11-09T22:16:26Z

Nice work but from 350000+ lines only around 2500 survived? Seems like the parameters used have been a little too strict...

Orivoir · 2021-11-10T15:14:31Z

I used less strict filter with frequency data of same API for ~30 000 words results but i think again that some words not real english words. see 4971374

jcnmsg · 2021-11-11T20:56:23Z

~30 000 would be closer to reality but it appears to have duplicated a bunch of words as well which were not duplicated on the original words_alpha.txt. See bedrock. bedroll, bedroom, bedspread, bedstead as examples...

Timokasse · 2021-12-24T09:24:39Z

Nice work but from 350000+ lines only around 2500 survived? Seems like the parameters used have been a little too strict...

The API is free for 2500 words per day. That is probably why....

jcnmsg · 2021-12-24T13:31:39Z

The API is free for 2500 words per day. That is probably why....

@Orivoir did get ~30 000 words just by using different parameters, so that was probably not it.

aploium · 2021-12-29T02:59:22Z

maybe it kills too much words. for example: blacklist is in it, but whitelist doesn't
sale not in, but sales is in

white lives matter, too [:joke:]

ghost · 2022-01-08T01:15:34Z

Hi all, I have run the words_alpha.txt through the "nltk" python library. Total words are 210693. This seems to be a bit better, but I have noticed there are still a few oddities in there (maybe things like common abbreviations remain, which aren't actual words). But overall I think this has cleaned out any non-english words.

words_alpha_clean.txt

silverwings15 · 2022-01-08T05:29:32Z

@SDidge appreciate the share!

jcnmsg · 2022-01-09T19:36:34Z

@SDidge At first glance I can't seem to find any non-english words on the file so I'd say this one is the cleanest file so far, nice work!

Timokasse · 2022-01-10T11:10:18Z

Hi all, I have run the words_alpha.txt through the "nltk" python library. Total words are 210693. This seems to be a bit better, but I have noticed there are still a few oddities in there (maybe things like common abbreviations remain, which aren't actual words). But overall I think this has cleaned out any non-english words.

words_alpha_clean.txt

@SDidge , what exactly did you use from the NLTK library to check the list of words?

ghost · 2022-01-10T21:02:00Z

@Timokasse , I just checked if the word existed in the "words" corpus

E.g.

from nltk.corpus import words

word for words_alpha if word in words

Something like this

Orivoir mentioned this pull request Nov 9, 2021

words_alpha.txt contains non-English words #104

Closed

use less strict filter words

4971374

Merge branch 'dwyl:master' into master

19c8beb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create words_alpha_clean.txt #108

Create words_alpha_clean.txt #108

Orivoir commented Oct 2, 2021 •

edited

Loading

jcnmsg commented Nov 9, 2021

Orivoir commented Nov 10, 2021

jcnmsg commented Nov 11, 2021

Timokasse commented Dec 24, 2021

jcnmsg commented Dec 24, 2021 •

edited

Loading

aploium commented Dec 29, 2021 •

edited

Loading

ghost commented Jan 8, 2022

silverwings15 commented Jan 8, 2022

jcnmsg commented Jan 9, 2022

Timokasse commented Jan 10, 2022

ghost commented Jan 10, 2022

Create words_alpha_clean.txt #108

Are you sure you want to change the base?

Create words_alpha_clean.txt #108

Conversation

Orivoir commented Oct 2, 2021 • edited Loading

jcnmsg commented Nov 9, 2021

Orivoir commented Nov 10, 2021

jcnmsg commented Nov 11, 2021

Timokasse commented Dec 24, 2021

jcnmsg commented Dec 24, 2021 • edited Loading

aploium commented Dec 29, 2021 • edited Loading

ghost commented Jan 8, 2022

silverwings15 commented Jan 8, 2022

jcnmsg commented Jan 9, 2022

Timokasse commented Jan 10, 2022

ghost commented Jan 10, 2022

Orivoir commented Oct 2, 2021 •

edited

Loading

jcnmsg commented Dec 24, 2021 •

edited

Loading

aploium commented Dec 29, 2021 •

edited

Loading