This project is part of the COMP472 course on Artificial Intelligence, focusing on experimenting with different machine learning algorithms and datasets. The primary objective is to gain practical experience with text classification and drug classification tasks using various classifiers.
- Duration: Fall 2021 semester (Deadline: October 18, 2021)
- Resources: Python 3.8, scikit-learn library, matplotlib, pandas
- Key Activities: Data preprocessing, model training, performance evaluation, and result analysis
- Dataset: BBC news articles (2225 documents, 5 classes)
- Classifier: Multinomial Naive Bayes
- Key Steps:
- Load and visualize dataset distribution
- Preprocess text data
- Split dataset (80% training, 20% testing)
- Train and evaluate models with different smoothing values
- Analyze and report various metrics
- Dataset: Drug dataset (categorical and numerical features)
- Classifiers:
- Gaussian Naive Bayes
- Decision Tree (Base and Optimized)
- Perceptron
- Multi-Layer Perceptron (Base and Optimized)
- Key Steps:
- Load and preprocess data
- Visualize class distribution
- Train and evaluate multiple classifiers
- Perform grid search for hyperparameter tuning
- Analyze and compare model performances
- Implemented Multinomial Naive Bayes with different smoothing values
- Achieved high accuracy (98.2%) and F1-scores
- Analyzed word frequencies, zero-frequency words, and log probabilities
- Implemented and compared 6 different classifiers
- Decision Tree models showed the best performance (100% accuracy)
- MLP and Gaussian NB showed moderate performance
- Perceptron showed the lowest performance
- Handling imbalanced datasets
- Implementing various classifiers and understanding their parameters
- Resolving these challenges through careful data preprocessing and parameter tuning
- Importance of data preprocessing in machine learning tasks
- Impact of smoothing values on Naive Bayes performance
- Effectiveness of decision trees for categorical data
- Significance of hyperparameter tuning for model optimization
This project provided hands-on experience with real-world machine learning tasks, emphasizing the importance of proper data handling, model selection, and performance evaluation in AI applications.
COMP472_A1_Instruction.pdf
: Instruction of this assignment distributed by COMP472 Professor.text-classification.py
: Implementation of BBC news classificationdrug-classification.py
: Implementation of drug classificationbbc-performance.txt
: Performance metrics for text classificationdrugs-performance.txt
: Performance metrics for drug classificationBBC-distribution.pdf
: Visualization of BBC dataset distributiondrug-distribution.pdf
: Visualization of drug dataset distributionbbc-discussion.txt
: Analysis of text classification resultsdrugs-discussion.txt
: Analysis of drug classification resultsCOMP472_A1_Presentation.pdf
: Detailed presentation of project results and analysisrequirements.txt
: List of required Python packages
The COMP472_A1_Presentation.pdf
file contains a comprehensive overview of the project results and analysis. It includes:
- Visualizations of dataset distributions
- Detailed results for both text and drug classification tasks
- Performance comparisons of different classifiers
- Analysis of model behaviors with different parameters
- Key findings and insights from the experiments This presentation serves as a valuable resource for understanding the project outcomes in depth and provides visual representations of the results.
- Ensure Python 3.8 and required packages are installed:
pip install -r requirements.txt
- For text classification:
python text-classification.py
- For drug classification:
python drug-classification.py
This project is for educational purposes as part of the COMP472 course. The BBC dataset and drug dataset are provided for academic use only.