AI & Soup Hybrid Webscraper for News Extraction
The Hybride_AI_Soup_Webscraper is a cutting-edge tool designed to extract news articles from diversified news portal pages. The unique aspect of this tool lies in its hybrid approach, combining the legacy CSS-selector method with the powerful text-processing capabilities of a Large Language Model (LLM) to handle one of the biggest challenges in web scraping: dynamic and varying HTML structures.
In this project, I have implemented a dual-method web scraper:
Most traditional web scrapers struggle with dynamic content and constantly changing HTML structures across different news websites. The combination of CSS selectors and LLM overcomes this by:
find_all_pagination_urls()get_and_clean_html()extract_news_articles_with_chatgpt()flatten_news()pip install beautifulsoup4 undetected-chromedriver openai requests
export OPENAI_API_KEY='your-api-key'
base_url = {List_base_URL} scrapper = NewsScrapperGeneral(base_url) scrapper.find_all_pagination_urls() scrapper.get_and_clean_html() scrapper.extract_news_articles_with_chatgpt() scrapper.flatten_news() print(json.dumps(scrapper.webpages, indent=4))
This tool provides a robust, adaptable, and intelligent solution for scraping news articles from a variety of websites. By combining traditional web scraping techniques with modern AI capabilities, the Hybride_AI_Soup_Webscraper overcomes many of the limitations faced by conventional scrapers, particularly when dealing with dynamic or changing HTML structures.