Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds data source properties to git connectors #1280

Merged
merged 45 commits into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a3fdca5
Adds data source properties to git connectors
rvztz Sep 1, 2023
d8580d2
Merge branch 'main' into data-source-properties-git
rvztz Sep 7, 2023
128b90a
Sets file_metadata as a functools.cached_property
rvztz Sep 7, 2023
5104b26
Sets Gitlab version to last_commit_id
rvztz Sep 8, 2023
906b5f1
Removes debugging logger
rvztz Sep 8, 2023
d7c36f9
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 8, 2023
b62dae8
Merge branch 'main' into data-source-properties-git
rvztz Sep 9, 2023
6dafd05
Changelog bump
rvztz Sep 9, 2023
b99b644
Removes updates fixtures to re-process
rvztz Sep 9, 2023
e407706
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 9, 2023
63315e6
Merge branch 'main' into data-source-properties-git
rvztz Sep 11, 2023
c32f51a
Solves merge issues
rvztz Sep 11, 2023
4cd960e
Merge branch 'main' into data-source-properties-git
rvztz Sep 12, 2023
0412dab
Adds `update_source_metadata` method to git-based connectors
rvztz Sep 14, 2023
57893b1
Merge branch 'main' into data-source-properties-git
rvztz Sep 14, 2023
1f2f0f1
sets default `source_metadata`
rvztz Sep 14, 2023
54e2381
linting
rvztz Sep 14, 2023
6086055
Merge branch 'main' into data-source-properties-git
rvztz Sep 15, 2023
56675af
Removes redundant exceptions
rvztz Sep 15, 2023
a4a44ea
Merge branch 'main' into data-source-properties-git
rvztz Sep 15, 2023
53b2080
decouples logic between fetching repo file and actual content
rvztz Sep 15, 2023
9ae8515
Update CHANGELOG.md
ryannikolaidis Sep 15, 2023
55710b6
Merge branch 'main' into data-source-properties-git
ryannikolaidis Sep 20, 2023
ffb7feb
Merge branch 'main' into data-source-properties-git
rvztz Sep 20, 2023
1830638
Merge branch 'data-source-properties-git' of github.com:Unstructured-…
rvztz Sep 20, 2023
d89900c
Merge branch 'main' into data-source-properties-git
rvztz Sep 20, 2023
7d653a0
Merge branch 'main' into data-source-properties-git
rvztz Sep 20, 2023
6c4945d
Version bump
rvztz Sep 20, 2023
06d0396
Removes url property from `record_locator`. Adds branch property to `…
rvztz Sep 21, 2023
4046385
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 21, 2023
2afc0a7
Merge branch 'main' into data-source-properties-git
rvztz Sep 27, 2023
4fe653e
Avoids setting `branch` on `record_locator`if its value is None
rvztz Sep 27, 2023
ff02748
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 27, 2023
485731c
Merge branch 'main' into data-source-properties-git
rvztz Sep 28, 2023
71538f3
Merge branch 'main' into data-source-properties-git
rvztz Sep 28, 2023
065dacb
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
3ce83ef
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
211cf2a
Merge branch 'data-source-properties-git' of github.com:Unstructured-…
rvztz Sep 29, 2023
a76a140
Adds previously removed expected-structured-output for notion
rvztz Sep 29, 2023
04ad11f
removes additional spaces
rvztz Sep 29, 2023
0c20f13
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
8e38bf7
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
243dfeb
Merge branch 'main' into data-source-properties-git
rvztz Oct 3, 2023
fa77878
version bump
rvztz Oct 3, 2023
efd704c
Merge branch 'main' into data-source-properties-git
rvztz Oct 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
## 0.10.19-dev9
## 0.10.19-dev10

### Enhancements

* **bump `unstructured-inference` to `0.6.6`** The updated version of `unstructured-inference` makes table extraction in `hi_res` mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the `hi_res` partitioning of pdfs and images.
* **Detect text in HTML Heading Tags as Titles** This will increase the accuracy of hierarchies in HTML documents and provide more accurate element categorization. If text is in an HTML heading tag and is not a list item, address, or narrative text, categorize it as a title.
* **Update python-based docs** Refactor docs to use the actual unstructured code rather than using the subprocess library to run the cli command itself.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, and Slack connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds data source properties to Github and Gitlab connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds Table support for the `add_chunking_strategy` decorator to partition functions.** In addition to combining elements under Title elements, user's can now specify the `max_characters=<n>` argument to chunk Table elements into TableChunk elements with `text` and `text_as_html` of length <n> characters. This means partitioned Table results are ready for use in downstream applications without any post processing.
* **Expose endpoint url for s3 connectors** By allowing for the endpoint url to be explicitly overwritten, this allows for any non-AWS data providers supporting the s3 protocol to be supported (i.e. minio).

Expand All @@ -27,7 +27,7 @@

* **Better detection of natural reading order in images and PDF's** The elements returned by partition better reflect natural reading order in some cases, particularly in complicated multi-column layouts, leading to better chunking and retrieval for downstream applications. Achieved by improving the `xy-cut` sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.
* **Improves `partition_xml` to be faster and more memory efficient when partitioning large XML files** The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, and DeltaTable connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Adds data source properties to SharePoint, Outlook, Onedrive, Reddit, Slack, DeltaTable** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive. This functionality enables downstream applications to reveal source document applications, e.g. a link to a GDrive doc, Salesforce record, etc.
* **Add functionality to save embedded images in PDF's separately as images** This allows users to save embedded images in PDF's separately as images, given some directory path. The saved image path is written to the metadata for the Image element. Downstream applications may benefit by providing users with image links from relevant "hits."
* **Azure Cognite Search destination connector** New Azure Cognitive Search destination connector added to ingest CLI. Users may now use `unstructured-ingest` to write partitioned data from over 20 data sources (so far) to an Azure Cognitive Search index.
* **Improves salesforce partitioning** Partitions Salesforce data as xlm instead of text for improved detail and flexibility. Partitions htmlbody instead of textbody for Salesforce emails. Importance: Allows all Salesforce fields to be ingested and gives Salesforce emails more detailed partitioning.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,15 @@
"type": "Title",
"element_id": "e3e5334b595ef9b648bf7f1f6c1a60c4",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain",
"languages": [
"eng"
Expand All @@ -15,7 +23,15 @@
"type": "Title",
"element_id": "8dc8800e5660b2558bb7f5f5416ca498",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain",
"languages": [
"eng"
Expand All @@ -27,7 +43,15 @@
"type": "NarrativeText",
"element_id": "fa3ff462f020dcadaf3c44b61f0df757",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain",
"languages": [
"eng"
Expand All @@ -39,7 +63,15 @@
"type": "NarrativeText",
"element_id": "70760316a66259dc346c891a2b964556",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain",
"languages": [
"eng"
Expand All @@ -51,7 +83,15 @@
"type": "NarrativeText",
"element_id": "1da9072633b5e4291608b205a664d5af",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain",
"languages": [
"eng"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,15 @@
"type": "Title",
"element_id": "56a9f768a0968be676f9addd5ec3032e",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -13,7 +21,15 @@
"type": "Title",
"element_id": "d551bbfc9477547e4dce6264d8196c7b",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
Expand All @@ -29,7 +45,15 @@
"type": "Title",
"element_id": "971b974235a86ca628dcc713d6e2e8d9",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -39,7 +63,15 @@
"type": "Title",
"element_id": "4112a488690bdbc1d39d5b78068eae9f",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -49,7 +81,15 @@
"type": "NarrativeText",
"element_id": "f89c9cf63bd2e72f560ee043d942a1e7",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -59,7 +99,15 @@
"type": "NarrativeText",
"element_id": "53a4db70c6d40ed5206711ed8a255e03",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -69,7 +117,15 @@
"type": "Title",
"element_id": "839973fba0c850f1729fad098b031203",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -79,7 +135,15 @@
"type": "NarrativeText",
"element_id": "b7db0dffb05f01f3f13d34420b82c261",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.19-dev9" # pragma: no cover
__version__ = "0.10.19-dev10" # pragma: no cover
17 changes: 17 additions & 0 deletions unstructured/ingest/connector/git.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,18 +37,35 @@ def filename(self):
def _output_filename(self):
return Path(self.partition_config.output_dir) / f"{self.path}.json"

@property
def record_locator(self) -> t.Dict[str, t.Any]:
record_locator = {
"repo_path": self.connector_config.repo_path,
"file_path": self.path,
}
if self.connector_config.branch is not None:
record_locator["branch"] = self.connector_config.branch
return record_locator

def _create_full_tmp_dir_path(self):
"""includes directories in in the gitlab repository"""
self.filename.parent.mkdir(parents=True, exist_ok=True)

def update_source_metadata(self, **kwargs):
raise NotImplementedError()

@SourceConnectionError.wrap
@BaseIngestDoc.skip_if_file_exists
def get_file(self):
print(self)
"""Fetches the "remote" doc and stores it locally on the filesystem."""
self._create_full_tmp_dir_path()
logger.debug(f"Fetching {self} - PID: {os.getpid()}")
self._fetch_and_write()

def _fetch_content(self) -> None:
raise NotImplementedError()

def _fetch_and_write(self) -> None:
raise NotImplementedError()

Expand Down
52 changes: 50 additions & 2 deletions unstructured/ingest/connector/github.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import typing as t
from dataclasses import dataclass
from datetime import datetime
from urllib.parse import urlparse

import requests
Expand All @@ -10,6 +11,7 @@
SimpleGitConfig,
)
from unstructured.ingest.error import SourceConnectionError
from unstructured.ingest.interfaces import SourceMetadata
from unstructured.ingest.logger import logger
from unstructured.utils import requires_dependencies

Expand Down Expand Up @@ -52,23 +54,69 @@ class GitHubIngestDoc(GitIngestDoc):
connector_config: SimpleGitHubConfig
registry_name: str = "github"

def _fetch_and_write(self) -> None:
content_file = self.connector_config.get_repo().get_contents(self.path)
@property
def date_created(self) -> t.Optional[str]:
return None

@requires_dependencies(["github"], extras="github")
def _fetch_file(self):
from github.GithubException import UnknownObjectException

try:
content_file = self.connector_config.get_repo().get_contents(self.path)
except UnknownObjectException:
logger.error(f"File doesn't exists {self.connector_config.url}/{self.path}")
return None

return content_file

def _fetch_content(self, content_file):
contents = b""
if (
not content_file.content # type: ignore
and content_file.encoding == "none" # type: ignore
and content_file.size # type: ignore
):
logger.info("File too large for the GitHub API, using direct download link instead.")
# NOTE: Maybe add a raise_for_status to catch connection timeout or HTTP Errors?
response = requests.get(content_file.download_url) # type: ignore
if response.status_code != 200:
logger.info("Direct download link has failed... Skipping this file.")
return None
else:
contents = response.content
else:
contents = content_file.decoded_content # type: ignore
return contents

def update_source_metadata(self, **kwargs):
content_file = kwargs.get("content_file", self._fetch_file())
if content_file is None:
self.source_metadata = SourceMetadata(
exists=False,
)
return

date_modified = datetime.strptime(
content_file.last_modified,
"%a, %d %b %Y %H:%M:%S %Z",
).isoformat()
self.source_metadata = SourceMetadata(
date_modified=date_modified,
version=content_file.etag,
source_url=content_file.download_url,
exists=True,
)

def _fetch_and_write(self) -> None:
content_file = self._fetch_file()
self.update_source_metadata(content_file=content_file)
contents = self._fetch_content(content_file)
if contents is None:
raise ValueError(
f"Failed to retrieve file from repo "
f"{self.connector_config.url}/{self.path}. Check logs",
)
with open(self.filename, "wb") as f:
f.write(contents)

Expand Down
Loading
Loading