Skip to content

Commit

Permalink
Add subsection for docs; prioritize getting started with container (#962
Browse files Browse the repository at this point in the history
)
  • Loading branch information
six5532one authored Jul 22, 2023
1 parent 35e529f commit 050cfaf
Showing 1 changed file with 59 additions and 53 deletions.
112 changes: 59 additions & 53 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,65 @@ Checkout the [`unstructured-api` repo](https://github.com/Unstructured-IO/unstru
to get started making API calls.
You’ll also find instructions there about how to host your own version of the API.

## :bookmark: Documentation
This README gives an overview of how to install, use and develop the library.
For more comprehensive documentation, visit https://unstructured-io.github.io/unstructured/ .

## :eight_pointed_black_star: Quick Start

There are two ways to use the library: 1) run a container or 2) install it

### Using the library in a container

The following instructions are intended to help you get up and running using Docker to interact with `unstructured`.
See [here](https://docs.docker.com/get-docker/) if you don't already have docker installed on your machine.

NOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. `docker pull` should download the corresponding image for your architecture, but you can specify with `--platform` (e.g. `--platform linux/amd64`) if needed.

We build Docker images for all pushes to `main`. We tag each image with the corresponding short commit hash (e.g. `fbc7a69`) and the application version (e.g. `0.5.5-dev1`). We also tag the most recent image with `latest`. To leverage this, `docker pull` from our image repository.

```bash
docker pull quay.io/unstructured-io/unstructured:latest
```

Once pulled, you can create a container from this image and shell to it.

```bash
# create the container
docker run -dt --name unstructured quay.io/unstructured-io/unstructured:latest

# this will drop you into a bash shell where the Docker image is running
docker exec -it unstructured bash
```

You can also build your own Docker image.

If you only plan on parsing one type of data you can speed up building the image by commenting out some
of the packages/requirements necessary for other data types. See Dockerfile to know which lines are necessary
for your use case.

```bash
make docker-build

# this will drop you into a bash shell where the Docker image is running
make docker-start-bash
```

Once in the running container, you can try things out directly in Python interpreter's interactive mode.
```bash
# this will drop you into a python console so you can run the below partition functions
python3

>>> from unstructured.partition.pdf import partition_pdf
>>> elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")

>>> from unstructured.partition.text import partition_text
>>> elements = partition_text(filename="example-docs/fake-text.txt")
```

### Installing the library
Use the following instructions to get up and running with `unstructured` and test your
installation. NOTE: We do not currently support python 3.11, please use an older version.
installation.

- Install the Python SDK with `pip install "unstructured[local-inference]"`
- If you do not need to process PDFs or images, you can run `pip install unstructured`
Expand All @@ -76,10 +131,10 @@ installation. NOTE: We do not currently support python 3.11, please use an older
- `poppler-utils` (images and PDFs)
- `tesseract-ocr` (images and PDFs)
- `libreoffice` (MS Office docs)
- `pandoc` (EPUBs, RTFs and Open Office docs)

- If you are parsing PDFs and want to use a model from the [layoutparser model
zoo](https://github.com/Unstructured-IO/unstructured-inference#using-models-from-the-layoutparser-model-zoo),
use the instructions [here](https://github.com/Unstructured-IO/unstructured-inference#detectron2).
- For suggestions on how to install on Windows and to learn about dependencies for other features, see the
installation documentation [here](https://unstructured-io.github.io/unstructured/installing.html).

At this point, you should be able to run the following code:

Expand Down Expand Up @@ -122,55 +177,6 @@ about the library.



## :dizzy: Instructions for using the docker image

The following instructions are intended to help you get up and running using Docker to interact with `unstructured`.
See [here](https://docs.docker.com/get-docker/) if you don't already have docker installed on your machine.

NOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. `docker pull` should download the corresponding image for your architecture, but you can specify with `--platform` (e.g. `--platform linux/amd64`) if needed.

We build Docker images for all pushes to `main`. We tag each image with the corresponding short commit hash (e.g. `fbc7a69`) and the application version (e.g. `0.5.5-dev1`). We also tag the most recent image with `latest`. To leverage this, `docker pull` from our image repository.

```bash
docker pull quay.io/unstructured-io/unstructured:latest
```

Once pulled, you can create a container from this image and shell to it.

```bash
# create the container
docker run -dt --name unstructured quay.io/unstructured-io/unstructured:latest

# this will drop you into a bash shell where the Docker image is running
docker exec -it unstructured bash
```

You can also build your own Docker image.

If you only plan on parsing one type of data you can speed up building the image by commenting out some
of the packages/requirements necessary for other data types. See Dockerfile to know which lines are necessary
for your use case.

```bash
make docker-build

# this will drop you into a bash shell where the Docker image is running
make docker-start-bash
```

Once in the running container, you can try things out directly in Python interpreter's interactive mode.
```bash
# this will drop you into a python console so you can run the below partition functions
python3

>>> from unstructured.partition.pdf import partition_pdf
>>> elements = partition_pdf(filename="example-docs/layout-parser-paper-fast.pdf")

>>> from unstructured.partition.text import partition_text
>>> elements = partition_text(filename="example-docs/fake-text.txt")
```


## :coffee: Installation Instructions for Local Development

The following instructions are intended to help you get up and running with `unstructured`
Expand Down

0 comments on commit 050cfaf

Please sign in to comment.