company-scraper-transformer

Supply Chain Companies Analysis using Scrapy and HuggingFace Transformers

Process:

Spider crawls the urls from self.company_urls defined in suppy_chain.py
Company details such as Name, House No., Status, Type, Address and Industry are fetched from find-and-update.company-information
Company reviews are scraped from trustpilot
More details such as Social accounts, Recent NEWS, Quality, Summary, Sectors, Products and Services of a company is scraped through it's official website. For this, a custon spider is defined.
Each scraped item is then processed in SupplyChainCrawlerPipeline and stored in MongoDB
Before closing the spider, documents are fetched from MongoDB and then HuggingFace Summarization and QnA tasks are applied over the summary.
Finalized data is updated in MongoDB and a PDF report is generaed.

Here is the architecture of our process

HuggingFace Models

Following models are used for Summarization task:

'summary_model': [
         'sshleifer/distilbart-cnn-12-6',
         'facebook/bart-large-cnn'
         ]

Following models are used for QnA task:

'qna_model': [
         'deepset/roberta-base-squad2',
         'deepset/bert-base-cased-squad2',
         'deepset/xlm-roberta-large-squad2',
         'ahotrod/electra_large_discriminator_squad2_512'
         ]

Challenges

Following are the challenges I accomplished during the process:

HuggingFace QnA models gives answer in few word
- For this, I summarized the whole data and broke it into several paragraph
- Applied existing QnA models on summarized text and extracted the line where answer is presen
- Finally, joined the answers and we got paragraph summarizing all the key points
Setting up Pipelines with MongoDB an Transformers
- Models were taking too much space and time to load. So, I used huggingface-inference-api
- Using PyMongo locally was not a good idea in long run, so I set up on cloud using MongoAtlas
- Therefore, time and space were reduced in Pipeline
Exporting report in a PDF
- There is no pre-defined function/library in FPDF to export DataFrame or Dict, so I manually coded from scratch and it worked!

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ppt		ppt
results		results
supply_chain_crawler		supply_chain_crawler
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

company-scraper-transformer

Process:

HuggingFace Models

Challenges

About

Releases

Packages

Languages

rish-hyun/company-scraper-transformer

Folders and files

Latest commit

History

Repository files navigation

company-scraper-transformer

Process:

HuggingFace Models

Challenges

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages