Skip to content

CodingWarrior33/Amazon-ML-Challenge-2k23

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Amazon-ML-Challenge-2k23

This repository provides an insight into the methods used by Team PowerPuff Girls in the Amazon Machine Learning Challenge 2023.

Note: This is not a comprehensive solution, but an assortment of the key model code that our team implemented.

Team Name : PowerPuff Girls

Team members :

  • Akarshan Kapoor
  • Samvaidan Salgotra
  • Taraksh Sambhar
  • Ayush Tiwari

Leaderboard Position : Rank 50.
Find the leaderboard here.

Explanation of Approaches

Approach 1

This approach constructs a Keras Sequential model, reads in the training data, and processes two features named "TITLE_DES" and "TITLE_BUL" using the TF-IDF vectoriser. It then pads the resulting vectors to a uniform length and trains the model. It iterates through the groups in the test DataFrame, processes the same two features, generates the corresponding input for the model, and uses the trained model to make predictions. Finally, the model collects all the predictions along with their corresponding "PRODUCT_ID" in a DataFrame for the final output.

Approach 2

This approach preprocesses the data by cleaning text columns, removing HTML tags, converting to lowercase, removing punctuation, and eliminating stopwords. The modified training data is saved to a new CSV file. The AutoKeras library is used to create a text regression model, which is trained on the preprocessed training data and loaded using TensorFlow. The preprocessed test data is combined into a single column, and the model is used to predict the "PRODUCT_LENGTH" for both the combined text and the title text in the test data.

Approach 3

This approach creates a BERT-based classifier. It begins by setting a fixed seed for reproducibility. After this, it performs the preprocessing tasks, such as handling duplicate entries and missing values. The titles are then encoded into numerical values using a pre-trained multilingual version of BERT, and the encoded data is split into training and validation sets. It then defines a PyTorch Dataset class and uses it to construct DataLoader instances for efficient iteration over the dataset during training and validation. The script builds a classification model by adding a dropout and a linear layer on top of the pre-trained BERT model. After setting up the learning rate scheduler and loss function, the training process is executed in a loop, where in each epoch, the model is trained on the entire dataset and then evaluated on the validation set. The best performing model on the validation set is saved. Lastly we evaluate the model using root mean square error (RMSE) as the metric.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages