Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completed Newbie Project #292

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

ethanwang-04
Copy link

@ethanwang-04 ethanwang-04 commented Oct 27, 2023

Implemented document file upload, parsing, text extraction, and vector upload to Pinecone with metadata. Also implemented vector retrieval from Pinecone. @trangiabach

@gitguardian
Copy link

gitguardian bot commented Oct 27, 2023

⚠️ GitGuardian has uncovered 4 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
8604658 Triggered Generic High Entropy Secret 8f873f3 backend/ohq/vector_db.py View secret
8604658 Triggered Generic High Entropy Secret 8f873f3 backend/ohq/vector_db.py View secret
8604658 Triggered Generic High Entropy Secret 82b8c3e backend/ohq/vector_db.py View secret
8604658 Triggered Generic High Entropy Secret 82b8c3e backend/ohq/vector_db.py View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secrets safely. Learn here the best practices.
  3. Revoke and rotate these secrets.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!

@trangiabach trangiabach self-requested a review October 28, 2023 16:51
openai.api_key = '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'
encoding = tiktoken.get_encoding("cl100k_base")

pinecone.init(api_key='2a81b230-ba5d-4290-a3ac-b60e6ad17a66', environment='gcp-starter')
Copy link
Contributor

@trangiabach trangiabach Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should put all the API keys in the .env file and get the secrets back via os.environ.get(API_KEY). Strongly recommend against committing keys without using .env and without adding .env into .gitignore when we are committing into a public repo!

Copy link
Contributor

@trangiabach trangiabach Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cleaner way to do this would be to also set up the keys inside the django.conf object. You can look at the sms.py to see how it was setup but essential, the steps are:

  1. Go to officehoursqueue/settings/base.py and define your environment secret keys there (follow the TWILIO example and set up the keys there)and it would look like this PINECONE_API_KEY=os.environ.get("PINECONE_API_KEY")
  2. Now you can just import the global settings object like this: from django.conf import settings anywhere within the codebase or look at ohq/sms.py.
  3. Get the secret keys by calling settings.PINECONE_API_KEY and load it as needed

@@ -9,6 +9,8 @@ frontend/yarn-error.log
__pycache__/
*.pyc

.env
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@@ -411,3 +411,24 @@ class Announcement(models.Model):
author = models.ForeignKey(User, related_name="announcements", on_delete=models.CASCADE)
time_updated = models.DateTimeField(auto_now=True)
course = models.ForeignKey(Course, related_name="announcements", on_delete=models.CASCADE)


class VectorDB(models.Model):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also add a __str__ method to this model to format the name of the VectorDB objects within django admin. So perhaps:

def __str__(self):
      return f"{self.course}: {self.name} Vector DB"

return False

if view.action == "retrieve":
return membership.is_ta
Copy link
Contributor

@trangiabach trangiabach Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would only mean that TAs roles can have permission to retrieve but not Leadership roles (like Head TA or Professor), so should be:

return membership.is_ta or membership.is_leadership

to have all course staff have access

return False

if view.action == "retrieve":
return membership.is_ta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the DocumentCreatePermission class! Include head tas and professor in retrieve permisson!

return membership.is_ta

# Head TAs+ can make changes
if view.action in ["retrieve", "create", "destroy", "update", "partial_update"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the DocumentCreatePermission class! Remove retrieve if retrieve permission is already defined above!

backend/ohq/urls.py Outdated Show resolved Hide resolved
@trangiabach
Copy link
Contributor

Implemented document file upload, parsing, text extraction, and vector upload to Pinecone with metadata. Also implemented vector retrieval from Pinecone. @trangiabach

@ethanwang-04 Overall good job on this PR (and cool stuff) and I left a few comments here and there. I would leave over some more comments as I see more possible fixes!

def search_index(term, top_k=5):
embedded_term = embed_vectors(term)

pinecone.init(api_key='2a81b230-ba5d-4290-a3ac-b60e6ad17a66', environment='gcp-starter')
Copy link
Contributor

@trangiabach trangiabach Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as the one about secret keys. Use the global django.conf settings object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to remove this

def search_index_with_metadata(term, metadata, top_k=5):
embedded_term = embed_vectors(term)

pinecone.init(api_key='2a81b230-ba5d-4290-a3ac-b60e6ad17a66', environment='gcp-starter')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as the one about secret keys. Use the global django.conf settings object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this


results = []
for match in matches:
print(match)
Copy link
Contributor

@trangiabach trangiabach Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to remove print statements from production code!

results = []
for match in matches:
print(match)
print("this is not metadata query")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to remove print statements from production code!

…r upload to Pinecone with metadata. Also implemented vector retrieval from Pinecone.

class DocumentCreateView(generics.CreateAPIView):

# permission_classes = [DocumentCreatePermission | IsSuperuser]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncomment this line to allow permissions to apply

from ohq.models import Document

openai.api_key = settings.OPENAI_API_KEY
pinecone.init(api_key=settings.PINECONE_API_KEY, environment='gcp-starter')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think u should also include 'gcp-starter' within the global settings object as well

@trangiabach
Copy link
Contributor

Implemented document file upload, parsing, text extraction, and vector upload to Pinecone with metadata. Also implemented vector retrieval from Pinecone. @trangiabach

Implemented document file upload, parsing, text extraction, and vector upload to Pinecone with metadata. Also implemented vector retrieval from Pinecone. @trangiabach

Implemented document file upload, parsing, text extraction, and vector upload to Pinecone with metadata. Also implemented vector retrieval from Pinecone. @trangiabach

Implemented document file upload, parsing, text extraction, and vector upload to Pinecone with metadata. Also implemented vector retrieval from Pinecone. @trangiabach

Encouraged writing some tests to make sure vector databases models works correctly, especially with parsing different kinds of documents into chunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants