Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds data source properties to git connectors #1280

Merged
merged 45 commits into from
Oct 10, 2023
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
a3fdca5
Adds data source properties to git connectors
rvztz Sep 1, 2023
d8580d2
Merge branch 'main' into data-source-properties-git
rvztz Sep 7, 2023
128b90a
Sets file_metadata as a functools.cached_property
rvztz Sep 7, 2023
5104b26
Sets Gitlab version to last_commit_id
rvztz Sep 8, 2023
906b5f1
Removes debugging logger
rvztz Sep 8, 2023
d7c36f9
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 8, 2023
b62dae8
Merge branch 'main' into data-source-properties-git
rvztz Sep 9, 2023
6dafd05
Changelog bump
rvztz Sep 9, 2023
b99b644
Removes updates fixtures to re-process
rvztz Sep 9, 2023
e407706
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 9, 2023
63315e6
Merge branch 'main' into data-source-properties-git
rvztz Sep 11, 2023
c32f51a
Solves merge issues
rvztz Sep 11, 2023
4cd960e
Merge branch 'main' into data-source-properties-git
rvztz Sep 12, 2023
0412dab
Adds `update_source_metadata` method to git-based connectors
rvztz Sep 14, 2023
57893b1
Merge branch 'main' into data-source-properties-git
rvztz Sep 14, 2023
1f2f0f1
sets default `source_metadata`
rvztz Sep 14, 2023
54e2381
linting
rvztz Sep 14, 2023
6086055
Merge branch 'main' into data-source-properties-git
rvztz Sep 15, 2023
56675af
Removes redundant exceptions
rvztz Sep 15, 2023
a4a44ea
Merge branch 'main' into data-source-properties-git
rvztz Sep 15, 2023
53b2080
decouples logic between fetching repo file and actual content
rvztz Sep 15, 2023
9ae8515
Update CHANGELOG.md
ryannikolaidis Sep 15, 2023
55710b6
Merge branch 'main' into data-source-properties-git
ryannikolaidis Sep 20, 2023
ffb7feb
Merge branch 'main' into data-source-properties-git
rvztz Sep 20, 2023
1830638
Merge branch 'data-source-properties-git' of github.com:Unstructured-…
rvztz Sep 20, 2023
d89900c
Merge branch 'main' into data-source-properties-git
rvztz Sep 20, 2023
7d653a0
Merge branch 'main' into data-source-properties-git
rvztz Sep 20, 2023
6c4945d
Version bump
rvztz Sep 20, 2023
06d0396
Removes url property from `record_locator`. Adds branch property to `…
rvztz Sep 21, 2023
4046385
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 21, 2023
2afc0a7
Merge branch 'main' into data-source-properties-git
rvztz Sep 27, 2023
4fe653e
Avoids setting `branch` on `record_locator`if its value is None
rvztz Sep 27, 2023
ff02748
Adds data source properties to git connectors <- Ingest test fixtures…
ryannikolaidis Sep 27, 2023
485731c
Merge branch 'main' into data-source-properties-git
rvztz Sep 28, 2023
71538f3
Merge branch 'main' into data-source-properties-git
rvztz Sep 28, 2023
065dacb
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
3ce83ef
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
211cf2a
Merge branch 'data-source-properties-git' of github.com:Unstructured-…
rvztz Sep 29, 2023
a76a140
Adds previously removed expected-structured-output for notion
rvztz Sep 29, 2023
04ad11f
removes additional spaces
rvztz Sep 29, 2023
0c20f13
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
8e38bf7
Merge branch 'main' into data-source-properties-git
rvztz Sep 29, 2023
243dfeb
Merge branch 'main' into data-source-properties-git
rvztz Oct 3, 2023
fa77878
version bump
rvztz Oct 3, 2023
efd704c
Merge branch 'main' into data-source-properties-git
rvztz Oct 10, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
## 0.10.15-dev14
## 0.10.15-dev15

### Enhancements

Expand All @@ -12,6 +12,7 @@
* **Better debug output related to sentence counting internals**. Clarify message when sentence is not counted toward sentence count because there aren't enough words, relevant for developers focused on `unstructured`s NLP internals.
* **Faster ocr_only speed for partitioning PDF and images.** Use `unstructured_pytesseract.run_and_get_multiple_output` function to reduce the number of calls to `tesseract` by half when partitioning pdf or image with `tesseract`
* **Adds data source properties to fsspec connectors** These properties (date_created, date_modified, version, source_url, record_locator) are written to element metadata during ingest, mapping elements to information about the document source from which they derive.
* **Adds data source properties (date_created, date_modified, version, exists, source_url, record_locator) to the git base interface.** Implements `update_source_metadata` method in git-based connectors.

ryannikolaidis marked this conversation as resolved.
Show resolved Hide resolved
### Features

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,16 @@
"type": "Title",
"element_id": "e3e5334b595ef9b648bf7f1f6c1a60c4",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
rvztz marked this conversation as resolved.
Show resolved Hide resolved
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain"
},
"text": "Downloadify: Client Side File Creation JavaScript + Flash Library"
Expand All @@ -12,7 +21,16 @@
"type": "Title",
"element_id": "8dc8800e5660b2558bb7f5f5416ca498",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain"
},
"text": "Copyright (c) 2009 Douglas C. Neiner"
Expand All @@ -21,7 +39,16 @@
"type": "NarrativeText",
"element_id": "fa3ff462f020dcadaf3c44b61f0df757",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain"
},
"text": "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:"
Expand All @@ -30,7 +57,16 @@
"type": "NarrativeText",
"element_id": "70760316a66259dc346c891a2b964556",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain"
},
"text": "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."
Expand All @@ -39,7 +75,16 @@
"type": "NarrativeText",
"element_id": "1da9072633b5e4291608b205a664d5af",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/LICENSE.txt",
"version": "W/\"2c4f1ab8689a6dfef4ee7d13d4d935cb6663a7e4\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "LICENSE.txt"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/plain"
},
"text": "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE."
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,16 @@
"type": "Title",
"element_id": "56a9f768a0968be676f9addd5ec3032e",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -13,7 +22,16 @@
"type": "Title",
"element_id": "d551bbfc9477547e4dce6264d8196c7b",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1,
"link_urls": [
Expand All @@ -29,7 +47,16 @@
"type": "Title",
"element_id": "971b974235a86ca628dcc713d6e2e8d9",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -39,7 +66,16 @@
"type": "Title",
"element_id": "4112a488690bdbc1d39d5b78068eae9f",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -49,7 +85,16 @@
"type": "NarrativeText",
"element_id": "f89c9cf63bd2e72f560ee043d942a1e7",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -59,7 +104,16 @@
"type": "NarrativeText",
"element_id": "53a4db70c6d40ed5206711ed8a255e03",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -69,7 +123,16 @@
"type": "Title",
"element_id": "839973fba0c850f1729fad098b031203",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand All @@ -79,7 +142,16 @@
"type": "NarrativeText",
"element_id": "b7db0dffb05f01f3f13d34420b82c261",
"metadata": {
"data_source": {},
"data_source": {
"url": "https://raw.githubusercontent.com/dcneiner/Downloadify/master/test.html",
"version": "W/\"c63c8fc21d46d44de85a14a3ed4baec0348ce344\"",
"record_locator": {
"url": "dcneiner/Downloadify",
"repo_path": "dcneiner/Downloadify",
"file_path": "test.html"
},
"date_modified": "2010-01-23T23:18:40"
},
"filetype": "text/html",
"page_number": 1
},
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.10.15-dev14" # pragma: no cover
__version__ = "0.10.15-dev15" # pragma: no cover
14 changes: 14 additions & 0 deletions unstructured/ingest/connector/git.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,21 @@ def filename(self):
def _output_filename(self):
return Path(self.partition_config.output_dir) / f"{self.path}.json"

@property
def record_locator(self) -> t.Dict[str, t.Any]:
return {
"url": self.connector_config.url,
"repo_path": self.connector_config.repo_path,
"file_path": self.path,
}

def _create_full_tmp_dir_path(self):
"""includes directories in in the gitlab repository"""
self.filename.parent.mkdir(parents=True, exist_ok=True)

def update_source_metadata(self, **kwargs):
raise NotImplementedError()

@SourceConnectionError.wrap
@BaseIngestDoc.skip_if_file_exists
def get_file(self):
Expand All @@ -49,6 +60,9 @@ def get_file(self):
logger.debug(f"Fetching {self} - PID: {os.getpid()}")
self._fetch_and_write()

def _fetch_content(self) -> None:
raise NotImplementedError()

def _fetch_and_write(self) -> None:
raise NotImplementedError()

Expand Down
52 changes: 50 additions & 2 deletions unstructured/ingest/connector/github.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import typing as t
from dataclasses import dataclass
from datetime import datetime
from urllib.parse import urlparse

import requests
Expand All @@ -10,6 +11,7 @@
SimpleGitConfig,
)
from unstructured.ingest.error import SourceConnectionError
from unstructured.ingest.interfaces import SourceMetadata
from unstructured.ingest.logger import logger
from unstructured.utils import requires_dependencies

Expand Down Expand Up @@ -52,23 +54,69 @@ class GitHubIngestDoc(GitIngestDoc):
connector_config: SimpleGitHubConfig
registry_name: str = "github"

def _fetch_and_write(self) -> None:
content_file = self.connector_config.get_repo().get_contents(self.path)
@property
def date_created(self) -> t.Optional[str]:
return None

@requires_dependencies(["github"], extras="github")
def _fetch_file(self):
from github.GithubException import UnknownObjectException

try:
content_file = self.connector_config.get_repo().get_contents(self.path)
except UnknownObjectException:
logger.error(f"File doesn't exists {self.connector_config.url}/{self.path}")
return None

return content_file

def _fetch_content(self, content_file):
contents = b""
if (
not content_file.content # type: ignore
and content_file.encoding == "none" # type: ignore
and content_file.size # type: ignore
):
logger.info("File too large for the GitHub API, using direct download link instead.")
# NOTE: Maybe add a raise_for_status to catch connection timeout or HTTP Errors?
response = requests.get(content_file.download_url) # type: ignore
if response.status_code != 200:
logger.info("Direct download link has failed... Skipping this file.")
return None
else:
contents = response.content
else:
contents = content_file.decoded_content # type: ignore
return contents

def update_source_metadata(self, **kwargs):
content_file = kwargs.get("content_file", self._fetch_file())
if content_file is None:
self.source_metadata = SourceMetadata(
exists=False,
)
return

date_modified = datetime.strptime(
content_file.last_modified,
"%a, %d %b %Y %H:%M:%S %Z",
).isoformat()
self.source_metadata = SourceMetadata(
date_modified=date_modified,
version=content_file.etag,
source_url=content_file.download_url,
exists=True,
)

def _fetch_and_write(self) -> None:
content_file = self._fetch_file()
self.update_source_metadata(content_file=content_file)
contents = self._fetch_content(content_file)
if contents is None:
raise ValueError(
f"Failed to retrieve file from repo "
f"{self.connector_config.url}/{self.path}. Check logs",
)
with open(self.filename, "wb") as f:
f.write(contents)

Expand Down
Loading
Loading