Toxic text big data visualisations

Big data visualizations - multiclassification of 6 types of toxicity: ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'].

Based on Toxic Comment Classifier which includes pretrained BERT model for Kaggle Toxic Comment Classification Challenge.

The pretrained BERT PyTorch model was used on 2 datasets:

jigsaw-toxic (training dataset)
polytical-tweets

where outputs of last hidden layer were extracted and fed as learning datasets into two dimension reduction models: PCA and UMAP. As a next step, we visualized all the reduced data to check how well the model grouped similar toxic sentences together. Later, the acquired results were converted to JSON and were saved along with the corresponding trained models as pickle files. As the last step, all the files were loaded inside a docker container and wrapped with an easy-to-access REST API which allows querying the data and classifying new toxic phrases with the pretrained models.

The process of data extraction, pretrained model usage, training PCA/UMAP and visualizing the results can be found in the Jupyter notebook

Screenshots

Used technologies

PyTorch - pretrained model + all ML related operations
Flask-RESTX - backend api for allowing REST operations on the created models/datasets
Plotly.js - visualising the datasets in browser
Vue - for creating the SPA frontend part
Docker - easy to run anywhere containers

Below you can find the details how to access the application.

Docker INFO

All the loaded models/datasets are really memory hungry, so be sure to give your docker resources around 13 GB if you plan to load all of the datasets.

How to start the project

DEV MODE

docker-compose up

Access the frontend on http://localhost:8080 Access the backend (swagger) on http://localhost:8081/api/

For easier backend development, when rebuilding docker image often, it would be wise to use pip cache since the modules weight much. Do the following:

Change the line with pip install in Dockerfile to:

RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt

docker-compose up

If the command fails, it might be necessary to:

Add the line # syntax = docker/dockerfile:experimental as first line in Dockerfile
Run export DOCKER_BUILDKIT=1 before issuing docker-compose up

PROD MODE

docker-compose -f docker-compose.prod.yml up

Optimized for deployment, uses nginx as reverse proxy, gunicorn to run the api and does not require docker volumes.

Access the frontend on http://localhost:8080 Access the backend from Vue on http://localhost:8080/api (swagger not exposed)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
backend		backend
frontend		frontend
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
big_data.ipynb		big_data.ipynb
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toxic text big data visualisations

Screenshots

Used technologies

Docker INFO

How to start the project

DEV MODE

PROD MODE

About

Releases

Packages

Contributors 3

Languages

License

LeszekBlazewski/WWZD

Folders and files

Latest commit

History

Repository files navigation

Toxic text big data visualisations

Screenshots

Used technologies

Docker INFO

How to start the project

DEV MODE

PROD MODE

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages