Supply Chain Companies Analysis using Scrapy and HuggingFace Transformers
- Spider crawls the urls from
self.company_urls
defined in suppy_chain.py - Company details such as Name, House No., Status, Type, Address and Industry are fetched from find-and-update.company-information
- Company reviews are scraped from trustpilot
- More details such as Social accounts, Recent NEWS, Quality, Summary, Sectors, Products and Services of a company is scraped through it's official website. For this, a custon spider is defined.
- Each scraped item is then processed in
SupplyChainCrawlerPipeline
and stored in MongoDB - Before closing the spider, documents are fetched from MongoDB and then HuggingFace Summarization and QnA tasks are applied over the summary.
- Finalized data is updated in MongoDB and a PDF report is generaed.
Here is the architecture of our process
Following models are used for Summarization task:
'summary_model': [
'sshleifer/distilbart-cnn-12-6',
'facebook/bart-large-cnn'
]
Following models are used for QnA task:
'qna_model': [
'deepset/roberta-base-squad2',
'deepset/bert-base-cased-squad2',
'deepset/xlm-roberta-large-squad2',
'ahotrod/electra_large_discriminator_squad2_512'
]
Following are the challenges I accomplished during the process:
-
HuggingFace QnA models gives answer in few word
- For this, I summarized the whole data and broke it into several paragraph
- Applied existing QnA models on summarized text and extracted the line where answer is presen
- Finally, joined the answers and we got paragraph summarizing all the key points
-
Setting up Pipelines with MongoDB an Transformers
- Models were taking too much space and time to load. So, I used huggingface-inference-api
- Using
PyMongo
locally was not a good idea in long run, so I set up on cloud using MongoAtlas - Therefore, time and space were reduced in Pipeline
-
Exporting report in a PDF
- There is no pre-defined function/library in
FPDF
to export DataFrame or Dict, so I manually coded from scratch and it worked!
- There is no pre-defined function/library in