Model: TheBloke/Llama-2-13b-Chat-Q4_K_M_GGUF (4-bit quantization, medium sized for balanced quality vs computational intensity)
BeautifulSoup - for HTML content parsing
LangChain - for inference pipeline and output parsing
##Approach
Extract content of the webpage via BSHTMLLoader from LangChain
Extract hrefs (URLs) from the HTML using BeautifulSoup
Get a list of companies mentioned within the article via Prompt Engineering
Get a JSON mapping of companies to company domain through Prompt Engineering by passing in hrefs extracted and the list of companies from step 3.
Get the topic of the article as a JSON structure via Prompt Engineering
Combine the company JSON and topic JSON into one and parse it as a Pydantic BaseModel Object.
Results
{"related_companies":[{"company_name":"X","company_domain":"x.com"},{"company_name":"Bloomberg","company_domain":"bloomberg.com"}],"topic":"X is launching two new subscription tiers, including a ‘Premium+’ ad-free plan"}