Skip to content

rish-hyun/company-scraper-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

company-scraper-transformer

Supply Chain Companies Analysis using Scrapy and HuggingFace Transformers


Process:

  • Spider crawls the urls from self.company_urls defined in suppy_chain.py
  • Company details such as Name, House No., Status, Type, Address and Industry are fetched from find-and-update.company-information
  • Company reviews are scraped from trustpilot
  • More details such as Social accounts, Recent NEWS, Quality, Summary, Sectors, Products and Services of a company is scraped through it's official website. For this, a custon spider is defined.
  • Each scraped item is then processed in SupplyChainCrawlerPipeline and stored in MongoDB
  • Before closing the spider, documents are fetched from MongoDB and then HuggingFace Summarization and QnA tasks are applied over the summary.
  • Finalized data is updated in MongoDB and a PDF report is generaed.

Here is the architecture of our process architecture


HuggingFace Models

Following models are used for Summarization task:

'summary_model': [
         'sshleifer/distilbart-cnn-12-6',
         'facebook/bart-large-cnn'
         ]

Following models are used for QnA task:

'qna_model': [
         'deepset/roberta-base-squad2',
         'deepset/bert-base-cased-squad2',
         'deepset/xlm-roberta-large-squad2',
         'ahotrod/electra_large_discriminator_squad2_512'
         ]

Challenges

Following are the challenges I accomplished during the process:

  1. HuggingFace QnA models gives answer in few word

    • For this, I summarized the whole data and broke it into several paragraph
    • Applied existing QnA models on summarized text and extracted the line where answer is presen
    • Finally, joined the answers and we got paragraph summarizing all the key points
  2. Setting up Pipelines with MongoDB an Transformers

    • Models were taking too much space and time to load. So, I used huggingface-inference-api
    • Using PyMongo locally was not a good idea in long run, so I set up on cloud using MongoAtlas
    • Therefore, time and space were reduced in Pipeline
  3. Exporting report in a PDF

    • There is no pre-defined function/library in FPDF to export DataFrame or Dict, so I manually coded from scratch and it worked!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages