Skip to content

Latest commit

 

History

History
37 lines (31 loc) · 1.91 KB

README.md

File metadata and controls

37 lines (31 loc) · 1.91 KB

Entity Resolution

Logistic regression binary classifier trained with camera descriptions. Predicts whether two items are similar. New relationships are induced, forming cliques. TF-IDF is used for the vectorization of the data sets.

For more information read the project's report.

How to compile / Makefile usage :

If run through make , the arguments used are defined in the main's Makefile located in programs/main

make (compiles everything , can be run from programs/main afterwards)
make run (run all the tests and the programs)
make run-programs (run only the programs)
make run-tests (run only the tests)
make valgrind (run all the tests and the programs with valgrind)
make valgrind-tests (run only the tests with valgrind)
make valgrind-programs (run only the programs with valgrind)
make clean (delete everything made by the Makefiles)

Argument flags :

-f : the folder which contains the folders with the jsons files.
-b : size of HashTable array used in cliqueGroup (optional)
-w : path to the datasetW csv file
-i : where the produced simillar pairs should be saved
-n : where the produced non-simillar (non-identical) pairs should be saved
-o : where the predictions and accuracy of the testing test should be saved
-v : vocabulary size for the dictionary
-e : number of epochs for training the model
-d : max accuracy difference, > 0 and <= 1> (if below this percentage, the difference is acceptable in testing and the result was accurate).
-r : learning rate for model training (double)
-thrd : number of threads
-bs : batch size
-train : training steps (1 means no retraining, 2 means 1 retraining step etc.)
-eq : discard extra pairs so identical and non identical pairs are ALWAYS equal before training

External libraries :

  • acutest.h , used for unit testing, include with command "git submodule update"