Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Otel-gpt2.5-llm #347

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
__pycache__
.mypy_cache/
models/
logs/
gpt2-finetuned/
gpt2-java-instrumentation/
156 changes: 123 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,58 +1,148 @@
**Status:** Archive (code is provided as-is, no updates expected)
# OpenTelemetry Code Generator with GPT-2

# gpt-2
This project fine-tunes a GPT-2 model to automatically generate OpenTelemetry instrumentation for Java code. The model is trained on a dataset of Java code snippets, and the fine-tuned model is capable of suggesting how to instrument Java methods/classes with OpenTelemetry.

Code and models from the paper ["Language Models are Unsupervised Multitask Learners"](https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf).
## Table of Contents

You can read about GPT-2 and its staged release in our [original blog post](https://openai.com/research/better-language-models/), [6 month follow-up post](https://openai.com/blog/gpt-2-6-month-follow-up/), and [final post](https://www.openai.com/blog/gpt-2-1-5b-release/).
- [Project Overview](#project-overview)
- [Setup and Installation](#setup-and-installation)
- [Preparing the Dataset](#preparing-the-dataset)
- [Training the Model](#training-the-model)
- [Generating Instrumented Code](#generating-instrumented-code)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)
- [License](#license)

We have also [released a dataset](https://github.com/openai/gpt-2-output-dataset) for researchers to study their behaviors.
## Project Overview

<sup>*</sup> *Note that our original parameter counts were wrong due to an error (in our previous blog posts and paper). Thus you may have seen small referred to as 117M and medium referred to as 345M.*
This project involves fine-tuning a GPT-2 model to automatically suggest OpenTelemetry instrumentation code for Java applications. The model is trained on a dataset containing pairs of non-instrumented and instrumented Java code snippets.

## Usage
## Setup and Installation

This repository is meant to be a starting point for researchers and engineers to experiment with GPT-2.
### 1. Clone the Repository

For basic information, see our [model card](./model_card.md).
```bash
git clone https://github.com/yourusername/gpt2-opentelemetry-codegen.git
cd gpt2-opentelemetry-codegen
```

### 2. Create a Virtual Environment

Create and activate a virtual environment using `venv`:

```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```

### Some caveats
### 3. Install Dependencies

- GPT-2 models' robustness and worst case behaviors are not well-understood. As with any machine-learned model, carefully evaluate GPT-2 for your use case, especially if used without fine-tuning or in safety-critical applications where reliability is important.
- The dataset our GPT-2 models were trained on contains many texts with [biases](https://twitter.com/TomerUllman/status/1101485289720242177) and factual inaccuracies, and thus GPT-2 models are likely to be biased and inaccurate as well.
- To avoid having samples mistaken as human-written, we recommend clearly labeling samples as synthetic before wide dissemination. Our models are often incoherent or inaccurate in subtle ways, which takes more than a quick read for a human to notice.
Install the required Python libraries:

### Work with us
```bash
pip install -r requirements.txt
```

Please [let us know](mailto:[email protected]) if you’re doing interesting research with or working on applications of GPT-2! We’re especially interested in hearing from and potentially working with those who are studying
- Potential malicious use cases and defenses against them (e.g. the detectability of synthetic text)
- The extent of problematic content (e.g. bias) being baked into the models and effective mitigations
The `requirements.txt` file should include the following packages:

## Development
```plaintext
transformers==4.30.2
datasets==2.10.1
torch==2.0.1
tensorflow==2.12.0 # Optional, if you want to use TensorFlow
```

See [DEVELOPERS.md](./DEVELOPERS.md)
### 4. Download the Pre-trained GPT-2 Model

## Contributors
The pre-trained GPT-2 model will be automatically downloaded from Hugging Face when you first run the script. No additional steps are required for this.

See [CONTRIBUTORS.md](./CONTRIBUTORS.md)
## Preparing the Dataset

## Citation
Create a dataset file named `java_code_dataset.txt` containing pairs of non-instrumented and instrumented Java code snippets. Here is an example format:

Please use the following bibtex entry:
```
@article{radford2019language,
title={Language Models are Unsupervised Multitask Learners},
author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
year={2019}
```plaintext
#### Non-Instrumented Code Example 1:
public class ExampleService {
public void process() {
System.out.println("Processing data...");
}
}

#### Instrumented Code Example 1:
// Instrumentation using OpenTelemetry
public class ExampleService {
public void process() {
// Start tracing
Span span = tracer.spanBuilder("process").startSpan();
try {
// Original logic
System.out.println("Processing data...");
} finally {
// End tracing
span.end();
}
}
}
```

Ensure that the file is properly formatted and saved in the root directory of the project.

## Training the Model

### 1. Set Up Training Script

The provided script `otel_code_gen_gpt2.py` handles the fine-tuning process.

### 2. Train the Model

To fine-tune the model on your dataset, run:

```bash
python otel_code_gen_gpt2.py
```

## Future work
This will:

1. Load the pre-trained GPT-2 model and tokenizer.
2. Tokenize your dataset.
3. Fine-tune the model on the dataset.
4. Save the fine-tuned model to the `./gpt2-java-instrumentation` directory.

### 3. Monitor Training

During training, you can monitor the loss and other metrics to ensure the model is learning effectively. The final model and tokenizer will be saved for future use.

## Generating Instrumented Code

After training, you can use the fine-tuned model to generate OpenTelemetry instrumentation suggestions for Java code. Modify the prompt in the script as needed:

We may release code for evaluating the models on various benchmarks.
```python
prompt = """
// Instrument this Java class with OpenTelemetry:

We are still considering release of the larger models.
public class ExampleService {
public void process() {
System.out.println("Processing data...");
}
}
"""

# Generate instrumented code suggestion
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs['input_ids'], max_length=256, num_return_sequences=1)

# Decode and print the generated code
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

Run the script again to see the output:

```bash
python otel_code_gen_gpt2.py
```

## License
## Troubleshooting

[Modified MIT](./LICENSE)
- **Model Overfitting**: If the model overfits (i.e., produces very low training loss but poor results), try expanding the dataset and adjusting training parameters.
- **Memory Issues**: If you run into memory issues, reduce the batch size in the `TrainingArguments`.
- **Unexpected Behavior**: If you see warnings about the attention mask or padding tokens, ensure that the tokenizer and model are correctly configured with the `pad_token`.
3 changes: 3 additions & 0 deletions dataset.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Once upon a time in a faraway land, there was a kingdom.
The king was known for his kindness and wisdom.
Every day, the villagers would gather to hear the king's advice.
62 changes: 62 additions & 0 deletions fine_tune_gpt2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset

# Step 1: Load the GPT-2 model and tokenizer
model_name = "gpt2" # You can choose from 'gpt2', 'gpt2-medium', 'gpt2-large', etc.
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Step 2: Set the pad_token to eos_token or add a custom padding token
tokenizer.pad_token = tokenizer.eos_token
# tokenizer.add_special_tokens({'pad_token': '[PAD]'}) # Alternative option

# Step 3: Load and prepare your dataset
# Replace 'dataset.txt' with the path to your dataset
dataset = load_dataset('text', data_files={'train': 'dataset.txt'})

# Tokenize the dataset and set labels
def tokenize_function(examples):
inputs = tokenizer(examples['text'], truncation=True, padding='max_length', max_length=128)
inputs["labels"] = inputs["input_ids"].copy() # Set labels to be the same as input_ids
return inputs

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Step 4: Set up training arguments
training_args = TrainingArguments(
output_dir="./gpt2-finetuned",
overwrite_output_dir=True,
num_train_epochs=3, # Adjust the number of epochs as needed
per_device_train_batch_size=2, # Adjust batch size according to your GPU's capability
save_steps=10_000,
save_total_limit=2,
logging_dir='./logs',
logging_steps=500,
)

# Step 5: Initialize the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
)

# Step 6: Fine-tune the model
trainer.train()

# Step 7: Save the fine-tuned model
trainer.save_model("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")

# Step 8: Generate text using the fine-tuned model
# Load the fine-tuned model and tokenizer
model = GPT2LMHeadModel.from_pretrained("./gpt2-finetuned")
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-finetuned")

# Generate text with the fine-tuned model
prompt = "In a distant future,"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(inputs['input_ids'], max_length=100, num_return_sequences=1)

# Decode and print the generated text
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Loading