LLM benchmark tool poc

Links

About

AI powers almost every interaction online. Businesses use LLM to deliver customized and innovative service to their customers. LLM vendors also announce a new model every few months now with promises of better performance.However, the most important question is how reliable, accurate, and well the models perform.

Our project is a chat tool designed for analysts and engineers. We aim for this application to serve as a platform for exploring and interacting with LLM and large datasets and providing visualizations about the model's performance.

Problem statement

GAIA is one of the best benchmarking datasets for agents. The questions in the dataset are categorized based on their level of difficulty. Using this dataset, we plan to build a platform for analysts and engineers to analyze the LLM performance based on the interactions.

The available vendor is currently limited to OPENAI and the use case is assumed to be validation data from the "GAIA dataset"

Key components and roles

GAIA dataset: Sourced from Huggingface and will serve as our data to sample the dataset Streamlit: Framework that powers our user interface and backend components (APIs, etc). We chose Streamlit for its flexibility and ease of use in creating web apps in Python Amazon S3: S3 offers secure and scalable storage for storing metadata and other input files. OpenAI API: This is currently one of the best LLM models for generating responses to queries with various difficulty and complexity Azure SQL database: We utilize Azure's seamless integration capabilities to store AI responses and other data relating to these responses. Specialized libraries: PyPDF2, Pytessaract, and pandas to handle extracting data from files of various formats(pdf, excel, etc.). Boto3, Pyodbc to handle db interactions and operations

Diagrams

Architecture

UI

Login page

Sign up page

User chat Interface

Visualizations

Response trend

Response distribution by difficulty level

Setup instructions

Pre-reqs

Poetry, ODBC driver for sql server (msodbcsql17), tesseract

Steps

clone repo


git clone [email protected]:BigData-Fall2024-TeamA3/Assignment1.git

Install dependencies


poetry install

Create .streamlit/secrets.toml file for following


OPENAI_API_KEY = ""

s3_file_key_path = "path/to/your/files/"

bucket_name = ''

s3_file_key = 'path/to/your/metadata.jsonl'

s3_file_url = ''

aws_access_key_id = ''

aws_secret_access_key = ''

  

# Azure Connection details

server = ''

database = ''

username = ''

password = ''

driver = ''

Run streamlit app


poetry run streamlit run app.py

Engineering

The diagram describes the overview of the dataflow within the app. The frontend and the backend components are built on Streamlit. The validation dataset/metadata and relevant files are stored on the S3 bucket in a JSON file and are loaded for dropdown questions

        metadata_df = load_jsonl_from_s3(bucket_name, s3_file_key)

On logging into the platform, The user selects a question from the dropdown (user input). On selecting a question, data retrieval of the relevant file is triggered from S3 bucket. We use AWS’s Boto3 library to achieve this.

	file_content = download_file_from_s3(bucket_name, file_path)

The data is now processed based on their type. The attachments are first processed based on their type. They are first extracted into text information, further tokenized and truncated before the processed query (the created prompt) is sent to OpenAI API

processed_content = process_file_based_on_extension(file_content, file_extension)

The file processing functions to extract text data

extract_text_from_pdf(file_content, file_extension)
extract_text_from_image(image_content)
extract_text_from_xlsx()
extract_text_from_python_script()

The processed data (prompt) is then sent OpenAI for analysis

openai_response = ask_openai(selected_question, processed_content)

The OpenAI response is validated against the existing dataset and stored on “azure sql database”. This output is displayed to the user (frontend)

insert_or_update_metadata(task_id, task_level, direct_response, annotator_response)

This data is later queried from azure for visualizations

fetch_data_from_azure()

Backend architecture, APIs and Visualizations Frontend and backend is handled by streamlit OpenAI API and AWS use python libraries (openai-python, Boto3) Azure SQL storage utilizes pyobdc for database operations Plotly is used for visualizations that summarize our response statistics

# example
px.histogram(data, x_label, y_label, ...)

Challenges Handling large data files: One of the bottlenecks were the token limits enforces by the OpenAI API. This is handled by tokenizing and truncating the data before sending the data.

def truncate_prompt(prompt, max_tokens):
       """Truncates the prompt to fit within the allowed token limit."""
       tokens = tokenizer.encode(prompt)
      
       # If prompt tokens exceed the max, truncate it
       if len(tokens) > max_tokens:
           truncated_tokens = tokens[:max_tokens]
           truncated_prompt = tokenizer.decode(truncated_tokens)
           return truncated_prompt
       return prompt

Error handling and Multiple file extensions: Extracting data from various file formats is difficult. This is handled by identifying the file format and further process them individually and in a manner best suited for each format. pytessaract package is used to process image content to text content. Pypdf package is used for process pdf files to text. Pandas package is used to process excel.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.devcontainer		.devcontainer
data		data
images		images
streamlit		streamlit
tests		tests
.gitignore		.gitignore
BDIA-Fall2024-Assignment1.pdf		BDIA-Fall2024-Assignment1.pdf
README.md		README.md
data_download.py		data_download.py
data_eda.ipynb		data_eda.ipynb
df.jpg		df.jpg
metadata.jsonl		metadata.jsonl
packages.txt		packages.txt
pyproject.toml		pyproject.toml
sample.json		sample.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM benchmark tool poc

Links

About

Problem statement

Key components and roles

Diagrams

Architecture

UI

Login page

Sign up page

User chat Interface

Visualizations

Response trend

Response distribution by difficulty level

Response distribution by difficulty level

Setup instructions

Pre-reqs

Steps

Engineering

References

About

Releases

Packages

Languages

nivgovind/LLM-benchmark-tool-poc

Folders and files

Latest commit

History

Repository files navigation

LLM benchmark tool poc

Links

About

Problem statement

Key components and roles

Diagrams

Architecture

UI

Login page

Sign up page

User chat Interface

Visualizations

Response trend

Response distribution by difficulty level

Response distribution by difficulty level

Setup instructions

Pre-reqs

Steps

Engineering

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages