Skip to content

AdritPal08/universal-web-scraper-using-generative-ai

Repository files navigation

Smart & Universal Web Scraper: Effortless Data Extraction, Powered by Generative AI 🦑

The Smart & Universal Web Scrapper is an intelligent data extraction tool powered by Generative AI. It simplifies the process of scraping data from any website by allowing users to provide the website link and the required data fields. With its versatile capabilities, this tool can extract data seamlessly and present it in a tabular format, which can be downloaded in various formats such as Excel, JSON, or Markdown. Its smart, user-friendly interface ensures efficient and accurate data extraction for all your web scraping needs.

How it Works

  1. Launch the Application: Open the Universal Web Scrapper on your system.
  2. Select an LLM Model: Choose the desired Large Language Model (LLM) from the available options.
  3. Input the Website Link: Paste the URL of the website from which you want to scrape data.
  4. Define the Data Fields: Specify the data fields you want to extract from the website.
  5. Automatic Data Extraction: The application intelligently scrapes the data and organizes it into a clear, structured table.
  6. Download the Data: Export the scraped data in your preferred format (Excel, JSON, Markdown).

It leverages the following technologies:

Python: Python is a popular, versatile programming language known for its simplicity and readability. It is widely used for various applications, including web development, data analysis, machine learning, and automation tasks. Python's extensive ecosystem of libraries and frameworks makes it a powerful tool for developers.

LLaMA 3.1 (70b): LLaMA (Lean Large-Language Model) is a family of large language models developed by Meta AI. The 3.1 (70b) version refers to a specific model variant with 70 billion parameters. Large language models like LLaMA are trained on vast amounts of text data, allowing them to understand and generate human-like text for various natural language processing tasks.

Groq API: Groq API provides access to Groq's powerful AI inference platform. It enables developers to leverage their advanced hardware and software for rapid and efficient AI model execution.

Streamlit: Streamlit is an open-source Python library that simplifies the process of building interactive data visualization and machine learning web applications. It allows developers to create user interfaces by writing Python scripts, making it easier to share data-driven applications with others.

Running the Project

  1. Fork or Clone the Repository::

Fork or clone this repository to your local machine using Git.

  1. Install Requirements:

Install the necessary libraries.

pip install -r requirements.txt
  1. Set Up Environment Variables:

Create a .env file in your project directory and add any required API keys (e.g., Google API key, Groq API KEY).

  1. Run the streamlit application file:
streamlit run app.py

License :

GPLv3 License GNU General Public License v3.0

Follow Me :

linkedin

Authors

  • If you like my work and it helped you in anyway then please do ⭐ the repository it will motivate me to make more amazing projects