Database implementation #92

jaimehisao · 2020-08-28T02:03:56Z

Started implementing the database without unit testing. While probably it won't get merged but here is the code.

* Create tfidf method for search engine * Pylint fixes * Add testing * Fix linting * Add numpy to requirements.txt * Adapt to articles/keyword object return * Change naming to agreed db convention * disable numpy import error Co-authored-by: Domene99 <jadomene99@gmail.com>

Domene99

Overall pretty impressive, I can't pinpoint where the code is becoming so slow, from 1 to 10 mins is more than just communication delay

services/parser-database/app.py

services/parser-database/connector.py

Domene99 · 2020-08-28T20:33:34Z

services/parser-database/connector.py

+    if doc is not None:
+        return doc.to_dict()
+    else:
+        return None


Maybe we should return some info for debugging purposes

Something like your last PR where you added more error handling?

services/parser-database/connector.py

jaimehisao · 2020-08-28T20:40:42Z

Overall pretty impressive, I can't pinpoint where the code is becoming so slow, from 1 to 10 mins is more than just communication delay

I don´t know either, but it is probably not on my end since it is the same code as the in-memory implementation.

jaimehisao · 2020-08-28T20:43:40Z

Would like to comment that the PR is still in draft form so there might be a lot of prints there and comments too, those are being addressed.

ssmall

Some preliminary feedback to guide your implementation, hope it helps!

ssmall · 2020-08-28T20:48:15Z

services/parser-database/app.py

+from connector import get_article_by_id_db
+from connector import get_articles_that_match_keywords_db


As @anniefu mentioned, you'll probably want to just have a db_connector.py that defines methods with the same names that you already use in the code, so that you can just change the import statement and not have to change any code.

services/parser-database/connector.py

ssmall · 2020-08-28T20:57:17Z

services/parser-database/connector.py

+def save_keywords_in_db(keywords, article):
+    """Saves the keywords from an article in memory
+
+    Args:
+        keywords (JSON): contains keywords
+        article (Article): article object
+    """
+    for keyword in keywords:
+        frequency = article["content"].count(keyword)
+
+        doc_ref = db.collection(u'keywords').where('keyword', '==', keyword)
+        doc = doc_ref.get()
+
+        if len(doc) != 0 and doc[0] is not None:
+            from_db = doc[0].to_dict()
+            print(from_db)
+            from_db["matching_articles"][article["id"]] = frequency
+            #print(from_db)
+            db.collection(u'keywords').document(doc[0].id).set(from_db)
+        else:
+            to_send = {"keyword": keyword, "matching_articles": {article["id"]: frequency}}
+            db.collection(u'keywords').add(to_send)


It looks like you're trying to mirror the same structure that you had for the in-memory implementation, which is not necessarily the best implementation. For example, keywords could be a field of the article document that gets queried directly, which eliminates the need to store keywords as a separate collection. See this guide on how to perform different queries against Firestore, particularly the example for array membership.

ssmall · 2020-08-28T20:58:19Z

services/parser-database/parser.py

@@ -38,7 +38,7 @@ class Article:
    def __init__(self, number, content):
        self.number = number
        self.content = content
-        self.id = str(number)
+        self.id = 'monterrey'+str(number)


Could this ID be generated by Firebase instead?

jaimehisao and others added 7 commits July 21, 2020 21:54

Added gitignore

f7fc4a9

Fixed merge issues

b5d9819

merge fix

099e126

Merge branch 'main' of github.com:jaimehisao/major-tom into main

bce7644

Merge branch 'main' of github.com:jaimehisao/major-tom into main

fc8a221

Added DB Functionality

5e89889

Domene99 reviewed Aug 28, 2020

View reviewed changes

ssmall reviewed Aug 28, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database implementation #92

Database implementation #92

jaimehisao commented Aug 28, 2020

Domene99 left a comment

Domene99 Aug 28, 2020

jaimehisao Aug 28, 2020

jaimehisao commented Aug 28, 2020

jaimehisao commented Aug 28, 2020

ssmall left a comment

ssmall Aug 28, 2020

ssmall Aug 28, 2020

ssmall Aug 28, 2020

		from connector import get_article_by_id_db
		from connector import get_articles_that_match_keywords_db

Database implementation #92

Are you sure you want to change the base?

Database implementation #92

Conversation

jaimehisao commented Aug 28, 2020

Domene99 left a comment

Choose a reason for hiding this comment

Domene99 Aug 28, 2020

Choose a reason for hiding this comment

jaimehisao Aug 28, 2020

Choose a reason for hiding this comment

jaimehisao commented Aug 28, 2020

jaimehisao commented Aug 28, 2020

ssmall left a comment

Choose a reason for hiding this comment

ssmall Aug 28, 2020

Choose a reason for hiding this comment

ssmall Aug 28, 2020

Choose a reason for hiding this comment

ssmall Aug 28, 2020

Choose a reason for hiding this comment