development

Playing with Pre-trained LLM Networks

Iain MacLeod

09 Sep 2024 • 10 min read

While my function is predominantly in the mid-senior technology space the market is consistently expecting, and requesting, management to have hands-on experience and to be able to function as Individual Contributors. On a personal note, I also like to understand the difficulties and tasks my teams face, so I am eager to learn myself.

My constant journey to upskill and stay current has me currently playing about with AI, in this post I am going to summarize some of my experiences and findings with Pre-trained LLM Networks.

Best way to learn anything is if you have an objective, define a project for yourself and work towards that goal - keep the project simple, the objective is to learn new technologies, not solve the worlds problems, yet.

Starting a Project

Define the requirements - What is the input and the output of this project, what are you trying to achieve? Write these down so you have your goal and can reference it as you progress, you don't want to lose sight of the task and end up with scope-creep and some behemoth solution - lets stick to a MVP.

Choose the Right Tools and Technologies

Analyze what you are trying to achieve, identify the problems you need to solve: ocr, screen scraping, image recognition etc. Take some time to research the available tools and technologies that you are available to accomplish your goals - leverage ChatGPT, reddit, Google, your network etc. to compare and contrast your choices based, and provide insight based on on factors that may impact your solution: cost, scalability, hosting, oper-source, security/privacy.

Build Your MVP

What is the minimum viable product, and if you were to break it into manageable tasks, components, processes, what would they be: web scraping, document parsing, image generator, document creation etc. Start to build out these individual pieces, with the intention of uniting them for the final solutions.

Test, Test, Test... Test Again

Validate, and verify your solution, iterate and improve accuracy and efficiency. Find shortfalls and areas of improvements, work out your next set of features.

Deploy

Host and allow access to others, gain feedback, evolve.

Starting My AI Project

I have decided to create a tool to evaluate a company website and report on the results, fundamentally a simple script to scape a website and summarize pages for me. Again, we are not trying to solve world peace, just gain real-word experience with creating an AI application.

First Things First - setting up the development environment

I needed to install python as this will be the main development language for my project. I am using Cursor for my IDE, it's a VS Code fork, completely free with some basic AI assistance - use what you are most comfortable with. In addition to installing Python, it is recommended to create a virtual environment (venv) for your project so that dependencies do not impact the global python installation. Creating a virtual environment allows you to have various projects using the same package, but different versions, you don't want another project you are developing to impact this one, or vice versa .

Install virtualenv

pip install virtualenv

Create a project folder where you are going to work. Within this folder (which I am calling ai-project-environment) you are going to create a virtual environment, which will contain a python interpreter and a copy of pip package manager, using the following command:

pip -m venv ai-project-environment

Activate this virtual environment:

ai-project-environment\Scripts\activate

Now when you install any dependencies they will be installed only in this environment, in my example I am installing requests, beautifulsoup4, transformers, and torch. If you have a subscription to OpenAI you might want to check out installing and using the openai. I am going to play with hugging face and their freely available pre-trained models:

pip install requests beautifulsoup4 transformers
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu

What Are These Packages?

Requests is a user-friendly HTTP library for Python that simplifies the process of interacting with web services. It allows developers to send HTTP requests effortlessly, handling tasks like query strings, POST data, and file uploads. Requests abstracts away the complexities of making HTTP requests, enabling easy integration with RESTful APIs and web scraping tasks.
BeautifulSoup4 is a powerful Python library for parsing HTML and XML documents. It creates a parse tree from page source code that can be used to extract data easily. BeautifulSoup4 excels at navigating, searching, and modifying the parse tree, making it an essential tool for web scraping projects.
Transformers library, developed by Hugging Face, is a comprehensive toolkit for working with state-of-the-art natural language processing models. It provides easy access to pre-trained models like BERT, GPT, and T5, along with their associated tokenizers. Transformers offers a unified API for various NLP tasks such as text classification, named entity recognition, and text generation. Its pipeline feature simplifies the use of complex models.
PyTorch is an open-source machine learning library developed by Facebook's AI Research lab. It provides a flexible and intuitive platform for building and training neural networks. It offers robust support for GPU acceleration, distributed training, and a rich ecosystem of tools and libraries. PyTorch's seamless integration with Python allows for natural coding workflows in deep learning projects.

All these packages form a powerful toolkit for various data science and AI tasks, from web data collection to advanced natural language processing and deep learning model development.

Saving Your Development Environment

Aside, if you are finished downloading packages for you virtual environment, you can deactivate it, and return to your global environments.

deactivate

To return to your ai-project-environment you need to execute activate just like you originally did:

ai-project-envirment\Scripts\activate

You can also store your dependencies for your project in a requirements file, so that it's easier to share and reinstall dependencies if required.

pip freeze > requirement.txt

Then you can run a simple command to install/reinstall requirements, so if you wanted to hand this environment off to someone to build there own app, you can just provide your requirements.txt and they can execute the following::

pip install-r requirements.txt

Lets Code

I am going to focus on discussing the code regarding the AI components in the scripts, but I will certainly comment and explain the basics. I am splitting the code up into sections to review, the following code sets up our basic variables as well as importing the appropriate functions for use within the script.

import requests
import os
import logging
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
from transformers import BartTokenizer, pipeline, T5Tokenizer # T5Tokenizer for philschmid/flan-t5-base-samsum, BartTokenizer for facebook/bart-large-cnn
from datetime import datetime

# Function to remove file if it exists, used to remove logging and summary outputs on each execution
def remove_file_if_exists(file_path):
    if os.path.exists(file_path):
        os.remove(file_path)
        print(f"Removed existing file: {file_path}")

# File paths for summary and debug output
debug_log_path = 'webscraper_debug.log'
summaries_path = 'summaries.txt'

# Remove existing files at the start of the script
remove_file_if_exists(debug_log_path)
remove_file_if_exists(summaries_path)

# Set up logging location and level
logging.basicConfig(filename=debug_log_path, level=logging.WARN, format='%(asctime)s - %(levelname)s - %(message)s')

# Summary output file path 
file_path = summaries_path

# URL of the website to scrape
base_url = "https://**siteyouwishtoscrape**"

Now that we have the basics setup, we are going to move onto determine all links relative to the domain that we specified and scrape the text from them. While completing this process we will make sure we don't pull in any duplicates, we also remove links containing the # character, as there are typically sub-references, we are just looking at the main content within this simple project. And finally, we remove sentences from the scraped text that are less than or equal to 7 words, this is just a simple solution to avoid labels or short statements that add no content to a summary:

# Set to track processed links and avoid duplicates
processed_links = set()

# Dictionary to store text from each page
scraped_texts = {}

# Send a GET request to fetch the raw HTML content
response = requests.get(base_url)
if response.status_code == 200:
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Try to extract the company name from the <title> tag just so we have a reference to the company relative to the website
    title_tag = soup.find('title')
    if title_tag:
        company_name = title_tag.get_text(strip=True)
        # Print out the company name we found
        print(f"Company Webpage Title: {company_name}")
    else:
        logging.warning("Company name not found in the title tag.")

    # Function to check if a link belongs to the main domain
    def is_same_domain(url, base_url):
        base_domain = urlparse(base_url).netloc
        link_domain = urlparse(url).netloc
        return base_domain == link_domain
    
    # Function to normalize URLs by removing trailing slashes
    def normalize_url(url):
        return url.rstrip('/')
    
    # Function to scrape the website recursively from a set of provided links
    def scrape_website(base_url, processed_links, scraped_texts):
        response = requests.get(base_url)
        
        if response.status_code == 200:
            # Ensure the correct encoding is detected
            response.encoding = response.apparent_encoding

            # Parse the HTML content using BeautifulSoup
            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract the text content of the current page
            page_text = soup.get_text(separator='\n', strip=True)
            
            # Filter out short lines (less than 7 words)
            filtered_lines = [line for line in page_text.split('\n') if len(line.split()) >= 7]
            filtered_text = '\n'.join(filtered_lines)
			
            # Provide the title for the page
            page_title = soup.title.string if soup.title else "No Title Found"
            scraped_texts[base_url] = (filtered_text, page_title)

            # Find all links on the current page
            links = soup.find_all('a', href=True)

            for link in links:
                href = link['href']
                 # Normalize href to avoid case-sensitive duplicates
                full_url = urljoin(base_url, href)
                normalized_url = normalize_url(full_url)

                # Skip links that contain a fragment identifier (#)
                if '#' in normalized_url:
                    continue

                # Check if the link belongs to the same domain and has not been processed
                if normalized_url not in processed_links and is_same_domain(normalized_url, base_url):
                    processed_links.add(normalized_url)
                    #print(f"Scraping link: {normalized_url}")

                    # Recursively scrape the next page
                    scrape_website(normalized_url, processed_links, scraped_texts)
        else:
            logging.warning(f"Failed to retrieve the page. Status code: {response.status_code}")
    
else:
    logging.warning(f"Failed to retrieve the website. Status code: {response.status_code}")

# Scrape website defined by base_url
scrape_website(base_url, processed_links, scraped_texts)

Now we have taken a URL and found all the links relative to the domain, we have made sure there are no duplicates, and we scraped the relevant text from each link and stored it all. The next step is to take the scraped text and present it to our pre-trained model of choice. I will be skipping scraped content that has little to no value, content below a certain length. When presenting data to a pre-trained model you are going to learn that the token indices sequence length is important, I will be evaluating this number to determine what to do with the text.

**IMPORTANT: Pretrained models will only allow a certain amount of text to be presented to them in a single request. This is referenced as the token sequence length of a model ,so if you end up scaping text you should evaluate the token indices sequence length of the scraped text and determine what you want to do, split it and send it in chuncks, truncate, skip. Additionally, based on what you are doing with the text, if the token indices sequence length is less than your required summary length, you may wish to skip it.

# Minimum token indices sequence length to consider for summarization
minimum_token = 300

# Load the tokenizer associated with the model
tokenizer = T5Tokenizer.from_pretrained("philschmid/flan-t5-base-samsum")

# Max length of token indices sequence length for model in use, this will be used for any improvements when scraping large content
max_input_length = 512  # Adjust this based on your model's actual limit

# Work out toekn indices sequence length for URL
for url, (text, title) in scraped_texts.items():
    tokens = tokenizer.tokenize(text)
    #print(f"Number of tokens from {url}: {len(tokens)}")

# Load a pre-trained summarization model (specify a model if needed) - I found some hallucination using the "facebook/bart-large-cnn"
summarizer = pipeline("text2text-generation", model="philschmid/flan-t5-base-samsum")

# Print to screen output
for url, (text, title) in scraped_texts.items():
    if text:  # Ensure there is text available
        tokens = tokenizer.tokenize(text)
        # Only summarize if token count is greater than or equal to the defined minimum_token length
        if len(tokens) >= minimum_token:
           summary = summarizer(text,
                                 max_length=minimum_token,
                                 min_length=200,
                                 do_sample=True,
                                 temperature=0.7, # Adjust randomness, lower values makes the output more focused
                                 #top_k=50,  # Limit to top 50 tokens at each step
                                 #top_p=0.95,  # Nucleus sampling is used to control diversity
                                 no_repeat_ngram_size=3,  # Avoid repeating 3-grams
                                 num_beams=4, # Beam search explores multiple possible sequences similtaneously, keeping track of the 'n' best
                                 early_stopping=True # Used in conjunction with num_beams, makes sure search process is stopped after num_beams complete cadidates have been generated
                                 )

            # Format a citation comment for the summarization, reference the site title, URL, and accessed date
            access_date = datetime.now().strftime("%B %d, %Y")
            citation_format = f"\n\nCitation: \"{title}\". Retrieved from {url} on {access_date}.\n"

            # Log the action of summarizing a page
            logging.warning(f"The original text retrieved from {url}:\n" + text +"\n")

            # Append the AI summary to file for saving
            with open(file_path, 'a') as file:
                file.write(summary[0]['generated_text'] + citation_format +"\n")
            logging.info("Summary written to file.")

        else:
            logging.warning(f"Usable text scraped from {url} has {len(tokens)}, less than {minimum_token} required tokens, skipping summarizing process.\n")
    else:
        logging.warning("No text available for summarization.")

Executing and Testing

I called my script webscraper.py, so to run the code all I need to do is execute the following:

python webscraper.py

The script executes slowly, scraping the pages and then presenting the data to the model for summarization. As noted in the post, the summarization will vary relative to the pre-trained model based on the amount of content it can handle for the summarization tasks and the parameters of the summarization task. Throughout my testing with some models I found the facebook/bart-large-cnn model to be quite good for the amount of content it could be presented, but was very aware of hallucination issues - some of my page summaries contained information about contacting the Suicide Prevention line, this appears to be a known bias issue with the facebook model which hasn't been resolved.

The philschmid/flan-t5-base-samsum model performed similar, without the obvious hallucination...however, the token indices sequence length is half the size of facebooks model.

When trying to adjust the model you will adjust the following parameters and test:

       summary = summarizer(text,
                                 max_length=minimum_token,
                                 min_length=200,
                                 do_sample=True,
                                 temperature=0.7, # Adjust randomness, lower values makes the output more focused
                                 #top_k=50,  # Limit to top 50 tokens at each step
                                 #top_p=0.95,  # Nucleus sampling is used to control diversity
                                 no_repeat_ngram_size=3,  # Avoid repeating 3-grams
                                 num_beams=4, # Beam search explores multiple possible sequences similtaneously, keeping track of the 'n' best
                                 early_stopping=True # Used in conjunction with num_beams, makes sure search process is stopped after num_beams complete cadidates have been generated
                                 )

Specifically I found some improvements by adjusting temperature and num_beams, this helps provide a more focused response and the best format for 'writing' the summary response.

Conclusion

This is a simple project, and a good start into your journey of playing with AI. Next steps... I am certainly going to be looking into training my own model, but I have some ideas floating around for some nice services using pretrained models. Use the comments and let me know your thoughts or ideas, let me know if you enjoyed the post and found any value in it.