Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow adding header and footer, autoconfirm, add js-tiktoken, add faster check for going to file-only mode #343

Open
wants to merge 10 commits into
base: dev
Choose a base branch
from

Conversation

elee1766
Copy link

lots of changes in this pr that ive been making for myself as ive been using the tool.

dont really think its ready to merge but thought i would put it here

it also fixes the punycode warning.

refactor: remove unused fs operations from esbuild.config.js
chore: update dependencies and add overrides in package.json
docs: update commit message guidelines in prompts.ts
refactor: switch to js-tiktoken for token encoding in tokenCount.ts
chore(config): add new config keys for prompt header and footer

style: reformat code for better readability and consistency across files

refactor(prompts.ts): restructure prompt generation for clarity and maintainability

chore: update dependencies and improve environment variable handling

style(server.ts): rename port variable to uppercase for consistency and clarity

chore(server.ts): add support for configurable port via process.env.PORT

style(engine.ts, tokenCount.ts, version.ts): standardize import quotes to single quotes for consistency
… directory

chore(package.json): add tiktoken dependency

style(config.ts, prompts.ts): fix formatting and indentation issues

refactor(tokenCount.ts): add native token counting using Tiktoken and conditional export based on environment variable
chore(commit.ts): add check for empty remotes array before selecting remote

perf(tokenCount.ts): move encoding initialization inside function to reduce memory usage
Introduce OCO_GIT_STAGE_ALWAYS config key to bypass user confirmation
for staging all files. This enhances automation and streamlines the
commit process when the config is set to true.
refactor(commit.ts): add auto-confirmation for commit and stage actions

refactor(config.ts): rename OCO_GIT_STAGE_ALWAYS to OCO_AUTOCONFIRM_STAGE and add OCO_AUTOCONFIRM_COMMIT

fix(openAi.ts): handle undefined message content in token count

style(prompts.ts): move OCO_PROMPT_FOOTER to the end of the prompt

style(mergeDiffs.ts): adjust import statement and spacing in token count check
@di-sukharev
Copy link
Owner

if you want this to be merged at some point, please describe the changes, this will help me to review the code

@elee1766
Copy link
Author

elee1766 commented May 24, 2024

i have some more changes and cleanup i would want to make, as i added the features just to make them work for me, and i wouldn't want to merge half baked hacks into your repo.

if i find some time to make this merge ready, i will probably redo much of it, because i am not happy with the hacky solutions i came up with. regardless, i'll still detail what i did in case it might be useful for your own development

FWIW, you can fix the punycode warning by adding this version override in package.json
this is probably the most annoying thing that can be fixed for a normal user.

  "overrides": {
    "whatwg-url": "13.0.0"
  },

otherwise, these changes two solve two issues with this app for my usage:

  1. you do tokenization in order to get token count a lot, and that's a rather slow process. i regularly have pretty large diffs that i want it to try to summarize, and it would lock up the process trying to do token counts when trying to figure out if 1. it needs to be split up file by file, and 2. if a file itself is too big.

i monkey patch this by adding a function which estimates the token count based off character count, but i think this is a bad solution and probably shouldnt be merged as it currently stands. it is good enough for my use cases. for instance, if the diff is 20,000 characters, it most likely will not fit in a 1000 token context.

  1. i need to generate semantic tags which do not use the keywords fix or feat, as these trigger my autoci, and i dont want the ai commits to decide when releases happen.

my hack was to add PREFIX/SUFFIX env vars to patch the prompt for the behavior i wanted, but i feel this is a bad solution, and really the entire prompt builder and configuration around prompt generation needs to be redone for me to feel happy about merging. im happy with my hack for personal use, but i really dont think it is something that people should use. the way that the prompt strings are created in the existing library is a mess. the repeated code across the two different prompt types really bothers me, along with the lack of customization around the prompt in general.

@di-sukharev di-sukharev changed the base branch from master to dev July 4, 2024 08:04
@di-sukharev
Copy link
Owner

@elee1766 hello, thanks for your contribution! could you please solve the conflicts?

to solve the /out folder conflicts you can just run npm run build

@di-sukharev
Copy link
Owner

@elee1766 hello hello :)

do you plan to finish the PR? the conflicts need to be resolved

@elee1766
Copy link
Author

@di-sukharev tbh, not really.

i tried using multiple models for a while, looking for one that would be good and fast enough for work, but really the results were super lackluster.

maybe when the models improve and I feel encouraged to try again, i may revisit this

@di-sukharev
Copy link
Owner

@elee1766 have you tried groq? i honestly think we can make models perform much better by improving the few shots we do before generating the commit.

What exactly would you like to see to be considered as an improvement you expect?

@elee1766
Copy link
Author

elee1766 commented Aug 18, 2024

@elee1766 have you tried groq? i honestly think we can make models perform much better by improving the few shots we do before generating the commit.

What exactly would you like to see to be considered as an improvement you expect?

i have not tried groq.

re-improvement- it has to be correct. ai struggles a lot at even the most basic prs. even when generating top 3 or top 5, I often failed to find a satisfactory commit message.

to start - the only model that could produce results anywhere close to correct was gpt4o. I could not get accurate messages out of any other model - they simply lie or are completely wrong, almost every time. i tried every openapi model and a few local models as well on my 4090.

that said, gpt4-o was still barely correct- it was largely the same as other models, but sometimes would not get tripped up where others would, but never performing much better.

for small commits, the ai is incredibly inconsistent. it mistakes things like refactoring and code cleanup for business logic changes. my simplest test was removing a single blank line the middle of a large code base. the commit messages should explain that it's deleting a single newline, as a refactor. for some reason this is not a task that the ai can consistently do. what's funny is it seems the context of where you deleted this empty line really makes a difference, the ai can get really confused lol. gpt4-o would get it right around 2/3rds of the time only, with other models faring worse. around 25-50%

for bigger commits, and by that I mean ~4-5 modified files, a few hundred lines, the AI barely functions. even on simple tasks that I deemed should be easy for ai to figure out - fully rewriting components without changing interfaces, fixing logic bugs, refactoring components by changing variable names, moving go packages from directory to directory. sometimes though a stroke of luck I can get an okay commit message, but it usually took a few tries, and I just wrote the message myself after the second try of trying to get AI to produce something correct. note that each try was 3 prompts.

finally, for larger commits/ squashing multiple commits into one pr, it's useless. say 10+ files and 1k+ lines, but up to say 50 files and 10k lines. these commits are normal in my workflow, and include multiple things. the ai gets overwhelmed, takes forever, cannot consume all the diff as context, and has never produced a satisfactory commit message that details all the different things I've changed. it only produced either compete fabrications of what happened in the changes, or only details a single part of what I've changed and doesn't do anything from there.

besides the ai not functioning for any large commits, the biggest issue with all of this is for sure the inconsistency. this tool should be reducing the amount of work I need to do, that is, I shouldn't have to read the response and check it over. I should be confident that the AI will produce a good commit message, and push without thinking twice about it, moving on to the rest of my work. if I have to babysit the results, then I may as well just write my own commit message. to me a 99% accuracy might be enough for me to trust, but right now it's like 30%.

I personally think the git diff format is perhaps not the correct input to be passing into AI, but I'm not super sure what the correct input is. the ai is also just missing holistic context of the code base, something that just the git diff cannot ever provide. it's also hard to parse the git diff for the AI, since it has to understand removed, added, unchanged, and modified line prefix markers.

for something like this to work, maybe some pre trained index of repository info for what each existing line does needs to be generated first and saved per repository or something, so the relevant context via file names and lines can be fed to the AI along with a reduced diff format that more explicitly says what change has happened. then maybe the AI would have enough context to generate a proper commit.

but I don't know - even top tier AI code assistants currently are unable to figure out and do work across multiple files in a code base at a time in any meaningful capacity - a task that oco needs to do. I would assume that if they haven't figured out how to properly index a repository, we likely won't any time soon either.

either models need to better understand the git diff format, and also support 100k+ symbols quickly, or something needs to fundamentally change about the process for generating the commit

@di-sukharev
Copy link
Owner

@elee1766 first — thank you very much for sharing your experience, it's a great opportunity to improve, thank you!

i also believe current implementation is not enough and what we should at least do is:

  1. keep (or generate on-fly) implementation context of what a file is doing
  2. fetch all changed files one by one and ask an LLM to reason about each change in a file
  3. generate the commit message

but i'm still not sure about the performance, im almost sure — it wont be 99% accurate, models are still "stupid"

what do you think about smth like this pipeline:

  1. you get a ticket in Jira, lets say it's a feature request: add Groq ai provider support, Ticket [Feature]: Cant I use this on sourcetree or github desktop? #123
  2. you run smth like oco branch "ticket 123 add groq support <and some more ticket context>
  3. oco creates you a semantic branch name, e.g. feature-123_add_groq_ai_provider_support
  4. now we know the context of your changes and can better guess why you are doing the changes to your code
  5. when you ask to generate a commit — we fetch the whole file, then we fetch ~20 lines above and ~20 lines below your changes, combine it with git --diff and ask an LLM to reason about it all together
  6. only then we ask an LLM to generate a commit message

@elee1766
Copy link
Author

yeah, i've been thinking of putting something along these lines into our pr workflow in our gitlab - that way you could perhaps build some chain of context. the added bonus is you get to make a super-pr-commit-message at the end.

i wonder what the best way to have the LLM properly understand the git diff format is. many times i noticed it getting confused, not knowing what is the before/after of the commit. perhaps having it process the difference between the before/after code instead the diff format might counter-intuitively do better

@di-sukharev
Copy link
Owner

@elee1766 exactly, this is 100% worth testing, i believe if we ask it to generate commit from human readable changes:

instead of weird + and @@ and -:

diff --git a/out/cli.cjs b/out/cli.cjs
index cf62280..cae8864 100755
--- a/out/cli.cjs
+++ b/out/cli.cjs
@@ -28111,7 +28111,7 @@ function G3(t2, e3) {
 // package.json
 var package_default = {
   name: "opencommit",
-  version: "3.0.19",
+  version: "3.0.18",
   description: "Auto-generate impressive commits in 1 second. Killing lame commits with AI \u{1F92F}\u{1F52B}",
   keywords: [
     "git",
@@ -30663,7 +30663,6 @@ var TestAi = class {
 // src/commands/config.ts
 var MODEL_LIST = {
   openai: [
-    "gpt-4o-mini",
     "gpt-3.5-turbo",
     "gpt-3.5-turbo-instruct",
     "gpt-3.5-turbo-0613",
@@ -30688,6 +30687,7 @@ var MODEL_LIST = {
     "gpt-4-32k-0613",
     "gpt-4o",
     "gpt-4o-2024-05-13",
+    "gpt-4o-mini",
     "gpt-4o-mini-2024-07-18"
   ],
   anthropic: [
@@ -30731,7 +30731,7 @@ var configValidators = {
     validateConfig(
       "OpenAI API_KEY",
       value || config12.OCO_ANTHROPIC_API_KEY || config12.OCO_AI_PROVIDER.startsWith("ollama") || config12.OCO_AZURE_API_KEY || config12.OCO_AI_PROVIDER == "test" || config12.OCO_AI_PROVIDER == "flowise",
-      "You need to provide an OpenAI/Anthropic/Azure or other provider API key via `oco config set OCO_OPENAI_API_KEY=your_key`, for help refer to docs https://github.com/di-sukharev/opencommit"
+      "You need to provide an OpenAI/Anthropic/Azure API key"
     );
     validateConfig(
       "OCO_OPENAI_API_KEY" /* OCO_OPENAI_API_KEY */,
@@ -30841,7 +30841,11 @@ var configValidators = {
   ["OCO_MODEL" /* OCO_MODEL */](value, config12 = {}) {
     validateConfig(
       "OCO_MODEL" /* OCO_MODEL */,
-      typeof value === "string",
+      [
+        ...MODEL_LIST.openai,
+        ...MODEL_LIST.anthropic,
+        ...MODEL_LIST.gemini
+      ].includes(value) || config12.OCO_AI_PROVIDER == "ollama" || config12.OCO_AI_PROVIDER == "azure" || config12.OCO_AI_PROVIDER == "test" || config12.OCO_AI_PROVIDER == "flowise",
       `${value} is not supported yet, use:
 
  ${[
diff --git a/out/github-action.cjs b/out/github-action.cjs
index 61b7a71..6cf111d 100644
--- a/out/github-action.cjs
+++ b/out/github-action.cjs
@@ -49470,7 +49470,6 @@ var TestAi = class {
 // src/commands/config.ts
 var MODEL_LIST = {
   openai: [
-    "gpt-4o-mini",
     "gpt-3.5-turbo",
     "gpt-3.5-turbo-instruct",
     "gpt-3.5-turbo-0613",
@@ -49495,6 +49494,7 @@ var MODEL_LIST = {
     "gpt-4-32k-0613",
     "gpt-4o",
     "gpt-4o-2024-05-13",
+    "gpt-4o-mini",
     "gpt-4o-mini-2024-07-18"
   ],
   anthropic: [
@@ -49538,7 +49538,7 @@ var configValidators = {
     validateConfig(
       "OpenAI API_KEY",
       value || config11.OCO_ANTHROPIC_API_KEY || config11.OCO_AI_PROVIDER.startsWith("ollama") || config11.OCO_AZURE_API_KEY || config11.OCO_AI_PROVIDER == "test" || config11.OCO_AI_PROVIDER == "flowise",
-      "You need to provide an OpenAI/Anthropic/Azure or other provider API key via `oco config set OCO_OPENAI_API_KEY=your_key`, for help refer to docs https://github.com/di-sukharev/opencommit"
+      "You need to provide an OpenAI/Anthropic/Azure API key"
     );
     validateConfig(
       "OCO_OPENAI_API_KEY" /* OCO_OPENAI_API_KEY */,
@@ -49648,7 +49648,11 @@ var configValidators = {
   ["OCO_MODEL" /* OCO_MODEL */](value, config11 = {}) {
     validateConfig(
       "OCO_MODEL" /* OCO_MODEL */,
-      typeof value === "string",
+      [
+        ...MODEL_LIST.openai,
+        ...MODEL_LIST.anthropic,
+        ...MODEL_LIST.gemini
+      ].includes(value) || config11.OCO_AI_PROVIDER == "ollama" || config11.OCO_AI_PROVIDER == "azure" || config11.OCO_AI_PROVIDER == "test" || config11.OCO_AI_PROVIDER == "flowise",
       `${value} is not supported yet, use:
 
  ${[
diff --git a/package-lock.json b/package-lock.json
index 4dc2f79..d1e27d6 100644
--- a/package-lock.json
+++ b/package-lock.json
@@ -1,12 +1,12 @@
 {
   "name": "opencommit",
-  "version": "3.0.19",
+  "version": "3.0.18",
   "lockfileVersion": 3,
   "requires": true,
   "packages": {
     "": {
       "name": "opencommit",
-      "version": "3.0.19",
+      "version": "3.0.18",
       "license": "MIT",
       "dependencies": {
         "@actions/core": "^1.10.0",
diff --git a/package.json b/package.json
index c8598b5..396a2ae 100644
--- a/package.json
+++ b/package.json
@@ -1,6 +1,6 @@
 {
   "name": "opencommit",
-  "version": "3.0.19",
+  "version": "3.0.18",
   "description": "Auto-generate impressive commits in 1 second. Killing lame commits with AI 🤯🔫",
   "keywords": [
     "git",
diff --git a/src/commands/config.ts b/src/commands/config.ts
index 644273c..fe01b96 100644
--- a/src/commands/config.ts
+++ b/src/commands/config.ts
@@ -45,7 +45,6 @@ export enum CONFIG_MODES {
 
 export const MODEL_LIST = {
   openai: [
-    'gpt-4o-mini',
     'gpt-3.5-turbo',
     'gpt-3.5-turbo-instruct',
     'gpt-3.5-turbo-0613',
@@ -70,6 +69,7 @@ export const MODEL_LIST = {
     'gpt-4-32k-0613',
     'gpt-4o',
     'gpt-4o-2024-05-13',
+    'gpt-4o-mini',
     'gpt-4o-mini-2024-07-18'
   ],
 
@@ -134,7 +134,7 @@ export const configValidators = {
         config.OCO_AZURE_API_KEY ||
         config.OCO_AI_PROVIDER == 'test' ||
         config.OCO_AI_PROVIDER == 'flowise',
-      'You need to provide an OpenAI/Anthropic/Azure or other provider API key via `oco config set OCO_OPENAI_API_KEY=your_key`, for help refer to docs https://github.com/di-sukharev/opencommit'
+      'You need to provide an OpenAI/Anthropic/Azure API key'
     );
     validateConfig(
       CONFIG_KEYS.OCO_OPENAI_API_KEY,
@@ -276,7 +276,15 @@ export const configValidators = {
   [CONFIG_KEYS.OCO_MODEL](value: any, config: any = {}) {
     validateConfig(
       CONFIG_KEYS.OCO_MODEL,
-      typeof value === 'string',
+      [
+        ...MODEL_LIST.openai,
+        ...MODEL_LIST.anthropic,
+        ...MODEL_LIST.gemini
+      ].includes(value) ||
+        config.OCO_AI_PROVIDER == 'ollama' ||
+        config.OCO_AI_PROVIDER == 'azure' ||
+        config.OCO_AI_PROVIDER == 'test' ||
+        config.OCO_AI_PROVIDER == 'flowise',
       `${value} is not supported yet, use:\n\n ${[
         ...MODEL_LIST.openai,
         ...MODEL_LIST.anthropic,
@@ -451,7 +459,6 @@ export const getConfig = ({
       outro(
         `Manually fix the '.env' file or global '~/.opencommit' config file.`
       );
-
       process.exit(1);
     }
   }

we can also ask it to reason about what @@ and + and -451,7 +459,6 would be mean in a git diff..

@di-sukharev
Copy link
Owner

i dont think this would help much though.. but maybe some other smart reasoning before the commit generation would

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants