Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run centillion on Heroku #20

Open
charlesreid1 opened this issue Aug 10, 2018 · 1 comment
Open

Run centillion on Heroku #20

charlesreid1 opened this issue Aug 10, 2018 · 1 comment

Comments

@charlesreid1
Copy link
Contributor

charlesreid1 commented Aug 10, 2018

Multiple problems standing in the way of running this as a serverless Flask instance, but there are several ways to solve them:

Problems

  1. The search index is stored on disk - it is not possible to have persistent files on disk on Heroku, things must be stored in memory.

  2. Google drive .docx documents must be downloaded to disk so that they can be converted to plain text using Pandoc - again, not possible to have persistent files on disk on Heroku, this must be done in memory.

  3. Pandoc is not a Python program, nor is it pip-installable. You can't build arbitrary packages on a Heroku node.

(Container) Solutions

Solve 1, 2, and 3 in a fell swoop by deploying Centillion to Heroku as a_Docker container (added advantage: Dockerizing services has proven relatively easy in the past). link with more info - basically, Heroku runs their own container registry, so you build docker images, test them, push them to the registry, and deploy Heroku nodes that run a container image.

(Can also easily do multi-container applications using docker-compose, as I now have experience building multi-container pods.)

(Container-less) Solutions

Solve 1 using the very well-developed solution of SQLAlchemy + Whoosh to store a search index in memory. This requires creating a database and linking the search index schema to the alchemy database, see e.g. gyllstromk/Flask-WhooshAlchemy

Solve 2 without containers by using some advanced piping tricks. Using the URLs for Drive documents, download the .docx file into a pipe, and pass contents of that pipe into pandoc. You can call pandoc on stdin just as you can call it on input files.

Solve 3 without containers by installing the Heroku pandoc buildpack into the project. This is the equivalent of running apt-get install pandoc on your Heroku node.

After installing the pandoc buildpack, pandoc is at /app/vendor/pandoc/bin, so you would probably call that binary with subprocess.Popen(). Alternatively use pypandoc (this would work because pandoc is added to $PATH when the pandoc build pack is installed, and that's how pypandoc finds a version of pandoc to wrap).

@charlesreid1
Copy link
Contributor Author

Path forward: do everything.

a) Dockerize it

also

b) implement sqlalchemy + pandoc buildpack + pypandoc + tricky subprocess pipes

also

c) providing service scripts/cron jobs if running as a native unix service

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant