Skip to content
This repository has been archived by the owner on May 15, 2023. It is now read-only.

Database implementation #92

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

jaimehisao
Copy link
Contributor

Started implementing the database without unit testing. While probably it won't get merged but here is the code.

jaimehisao and others added 7 commits July 21, 2020 21:54
* Create tfidf method for search engine

* Pylint fixes

* Add testing

* Fix linting

* Add numpy to requirements.txt

* Adapt to articles/keyword object return

* Change naming to agreed db convention

* disable numpy import error

Co-authored-by: Domene99 <jadomene99@gmail.com>
Copy link
Contributor

@Domene99 Domene99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall pretty impressive, I can't pinpoint where the code is becoming so slow, from 1 to 10 mins is more than just communication delay

services/parser-database/app.py Show resolved Hide resolved
services/parser-database/connector.py Show resolved Hide resolved
services/parser-database/connector.py Show resolved Hide resolved
if doc is not None:
return doc.to_dict()
else:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should return some info for debugging purposes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like your last PR where you added more error handling?

services/parser-database/connector.py Show resolved Hide resolved
services/parser-database/connector.py Show resolved Hide resolved
@jaimehisao
Copy link
Contributor Author

Overall pretty impressive, I can't pinpoint where the code is becoming so slow, from 1 to 10 mins is more than just communication delay

I don´t know either, but it is probably not on my end since it is the same code as the in-memory implementation.

@jaimehisao
Copy link
Contributor Author

Would like to comment that the PR is still in draft form so there might be a lot of prints there and comments too, those are being addressed.

Copy link
Contributor

@ssmall ssmall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary feedback to guide your implementation, hope it helps!

Comment on lines +10 to +11
from connector import get_article_by_id_db
from connector import get_articles_that_match_keywords_db
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @anniefu mentioned, you'll probably want to just have a db_connector.py that defines methods with the same names that you already use in the code, so that you can just change the import statement and not have to change any code.

services/parser-database/connector.py Show resolved Hide resolved
Comment on lines +163 to +184
def save_keywords_in_db(keywords, article):
"""Saves the keywords from an article in memory

Args:
keywords (JSON): contains keywords
article (Article): article object
"""
for keyword in keywords:
frequency = article["content"].count(keyword)

doc_ref = db.collection(u'keywords').where('keyword', '==', keyword)
doc = doc_ref.get()

if len(doc) != 0 and doc[0] is not None:
from_db = doc[0].to_dict()
print(from_db)
from_db["matching_articles"][article["id"]] = frequency
#print(from_db)
db.collection(u'keywords').document(doc[0].id).set(from_db)
else:
to_send = {"keyword": keyword, "matching_articles": {article["id"]: frequency}}
db.collection(u'keywords').add(to_send)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're trying to mirror the same structure that you had for the in-memory implementation, which is not necessarily the best implementation. For example, keywords could be a field of the article document that gets queried directly, which eliminates the need to store keywords as a separate collection. See this guide on how to perform different queries against Firestore, particularly the example for array membership.

@@ -38,7 +38,7 @@ class Article:
def __init__(self, number, content):
self.number = number
self.content = content
self.id = str(number)
self.id = 'monterrey'+str(number)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this ID be generated by Firebase instead?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants