Skip to content

Commit

Permalink
Refresh metastore and Lambda searcher (#4985)
Browse files Browse the repository at this point in the history
* Add metastore refresh poll

* Minor usability improvments

* Add configuration docs

* Fix polling interval formatting

* Change poetry to pipenv

* Add type stubs for mypy

* Force approval when deploying cdk

* Fix module imports

* Try install module with setuptool

* Apply review comments
  • Loading branch information
rdettai authored May 16, 2024
1 parent 7f8f938 commit 0e4ae0b
Show file tree
Hide file tree
Showing 18 changed files with 577 additions and 1,160 deletions.
12 changes: 9 additions & 3 deletions .github/workflows/publish_lambda_packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,15 @@ jobs:
- name: Install rustup
run: curl https://sh.rustup.rs -sSf | sh -s -- --default-toolchain none -y
- name: Install python dependencies
run: pip install ./distribution/lambda
- name: Mypy lint
run: mypy distribution/lambda/
run: |
pip install --user pipenv
pipenv install --system
working-directory: ./distribution/lambda
- name: Lint and format
run: |
mypy .
black . --check
working-directory: ./distribution/lambda
- name: Retrieve and export commit date, hash, and tags
run: |
echo "QW_COMMIT_DATE=$(TZ=UTC0 git log -1 --format=%cd --date=format-local:%Y-%m-%dT%H:%M:%SZ)" >> $GITHUB_ENV
Expand Down
9 changes: 5 additions & 4 deletions distribution/lambda/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ package:
if [ "$${QW_LAMBDA_BUILD:-0}" = "1" ]
then
pushd ../../quickwit/
rustc --version
cargo lambda build \
-p quickwit-lambda \
--release \
Expand Down Expand Up @@ -60,10 +61,10 @@ bootstrap:
cdk bootstrap aws://$$CDK_ACCOUNT/$$CDK_REGION

deploy-hdfs: package check-env
cdk deploy -a cdk/app.py HdfsStack
cdk deploy --require-approval never -a cdk/app.py HdfsStack

deploy-mock-data: package check-env
cdk deploy -a cdk/app.py MockDataStack
cdk deploy --require-approval never -a cdk/app.py MockDataStack

print-mock-data-metastore: check-env
python -c 'from cdk import cli; cli.print_mock_data_metastore()'
Expand All @@ -76,11 +77,11 @@ before-destroy:

destroy-hdfs: before-destroy check-env
python -c 'from cdk import cli; cli.empty_hdfs_bucket()'
cdk destroy --force -a cdk/app.py HdfsStack
cdk destroy --force -a cdk/app.py HdfsStack

destroy-mock-data: before-destroy check-env
python -c 'from cdk import cli; cli.empty_mock_data_buckets()'
cdk destroy --force -a cdk/app.py MockDataStack
cdk destroy --force -a cdk/app.py MockDataStack

clean:
rm -rf cdk.out
Expand Down
23 changes: 23 additions & 0 deletions distribution/lambda/Pipfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
[[source]]
url = "https://pypi.org/simple"
verify_ssl = true
name = "pypi"

[packages]
cdk = {file = "cdk", editable = true}
aws-cdk-lib = "2.95.1"
cargo-lambda = "1.1.0"
constructs = "10.3.0"
pyyaml = "6.0.1"
black = "24.3.0"
boto3 = "1.28.59"
mypy = "1.7.0"
ziglang = "0.11.0"

# types
boto3-stubs = "1.28.59"
types-requests = "2.31.0.2"
types-pyyaml = "6.0.12.11"

[requires]
python_version = "3.10"
441 changes: 441 additions & 0 deletions distribution/lambda/Pipfile.lock

Large diffs are not rendered by default.

67 changes: 39 additions & 28 deletions distribution/lambda/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,36 +31,13 @@ console](https://console.aws.amazon.com/servicequotas/home/services/lambda/quota
### Python venv

This project is set up like a standard Python project. The initialization
process also creates a virtualenv within this project, stored under the `.venv`
directory. To create the virtualenv it assumes that there is a `python3`
executable in your path with access to the `venv` package. If for any reason the
automatic creation of the virtualenv fails, you can create the virtualenv
manually.
The Python environment is configured using pipenv:

To manually create a virtualenv on MacOS and Linux:

```bash
python3 -m venv .venv
```

After the init process completes and the virtualenv is created, you can use the following
step to activate your virtualenv.

```bash
source .venv/bin/activate
```

Once the virtualenv is activated, you can install the required dependencies.

```bash
pip install .
```

If you prefer using Poetry, achieve the same by running:
```bash
poetry shell
poetry install
# Install pipenv if needed.
pip install --user pipenv
pipenv shell
pipenv install
```

### Example stacks
Expand Down Expand Up @@ -99,6 +76,40 @@ make deploy-mock-data
make invoke-mock-data-searcher
```

### Configurations

The following environment variables can be configured on the Lambda functions.
Note that only a small subset of all Quickwit configurations are exposed to
simplify the setup and avoid unstable deployments.

| Variable | Description | Default |
|---|---|---|
| QW_LAMBDA_INDEX_ID | the index this Lambda interacts with (one and only one) | required |
| QW_LAMBDA_METASTORE_BUCKET | bucket name for metastore files | required |
| QW_LAMBDA_INDEX_BUCKET | bucket name for split files | required |
| QW_LAMBDA_OPENTELEMETRY_URL | HTTP OTEL tracing collector endpoint | none, OTEL disabled |
| QW_LAMBDA_OPENTELEMETRY_AUTHORIZATION | Authorization header value for HTTP OTEL calls | none, OTEL disabled |
| QW_LAMBDA_ENABLE_VERBOSE_JSON_LOGS | true to enable JSON logging of spans and logs in Cloudwatch | false |
| RUST_LOG | [Rust logging config][1] | info |

[1]: https://rust-lang-nursery.github.io/rust-cookbook/development_tools/debugging/config_log.html


Indexer only:
| Variable | Description | Default |
|---|---|---|
| QW_LAMBDA_INDEX_CONFIG_URI | location of the index configuration file, e.g `s3://mybucket/index-config.yaml` | required |
| QW_LAMBDA_DISABLE_MERGE | true to disable compaction merges | false |
| QW_LAMBDA_DISABLE_JANITOR | true to disable retention enforcement and garbage collection | false |
| QW_LAMBDA_MAX_CHECKPOINTS | maximum number of ingested file names to keep in source history | 100 |

Searcher only:
| Variable | Description | Default |
|---|---|---|
| QW_LAMBDA_SEARCHER_METASTORE_POLLING_INTERVAL_SECONDS | refresh interval of the metastore | 60 |
| QW_LAMBDA_PARTIAL_REQUEST_CACHE_CAPACITY | `searcher.partial_request_cache_capacity` node config | 64M |


### Set up a search API

You can configure an HTTP API endpoint around the Quickwit Searcher Lambda. The
Expand Down
7 changes: 4 additions & 3 deletions distribution/lambda/cdk/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@

import aws_cdk as cdk

from cdk.stacks.services.quickwit_service import DEFAULT_LAMBDA_MEMORY_SIZE
from cdk.stacks.examples.hdfs_stack import HdfsStack
from cdk.stacks.examples.mock_data_stack import MockDataStack
from stacks.services.quickwit_service import DEFAULT_LAMBDA_MEMORY_SIZE
from stacks.examples.hdfs_stack import HdfsStack
from stacks.examples.mock_data_stack import MockDataStack

HDFS_STACK_NAME = "HdfsStack"
MOCK_DATA_STACK_NAME = "MockDataStack"
Expand Down Expand Up @@ -50,6 +50,7 @@ def package_location_from_env(type: Literal["searcher"] | Literal["indexer"]) ->
indexer_package_location=package_location_from_env("indexer"),
searcher_package_location=package_location_from_env("searcher"),
search_api_key=os.getenv("SEARCHER_API_KEY", None),
data_generation_interval_sec=int(os.getenv("DATA_GENERATION_INTERVAL_SEC", 300)),
)

app.synth()
4 changes: 2 additions & 2 deletions distribution/lambda/cdk/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,8 @@
import boto3
import botocore.config
import botocore.exceptions
from cdk import app
from cdk.stacks.examples import hdfs_stack, mock_data_stack
from . import app
from stacks.examples import hdfs_stack, mock_data_stack

region = os.environ["CDK_REGION"]

Expand Down
7 changes: 7 additions & 0 deletions distribution/lambda/cdk/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from setuptools import setup, find_packages

setup(
name="cdk",
version="0.1.0",
packages=find_packages(),
)
1 change: 1 addition & 0 deletions distribution/lambda/cdk/stacks/examples/hdfs_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ def __init__(
searcher_memory_size=searcher_memory_size,
indexer_package_location=indexer_package_location,
searcher_package_location=searcher_package_location,
indexer_timeout=aws_cdk.Duration.minutes(10),
)

aws_cdk.CfnOutput(
Expand Down
14 changes: 12 additions & 2 deletions distribution/lambda/cdk/stacks/examples/mock_data_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ def __init__(
construct_id: str,
index_id: str,
qw_svc: quickwit_service.QuickwitService,
data_generation_interval_sec: int,
**kwargs,
):
super().__init__(scope, construct_id, **kwargs)
Expand Down Expand Up @@ -59,7 +60,9 @@ def __init__(
rule = aws_events.Rule(
self,
"ScheduledRule",
schedule=aws_events.Schedule.rate(aws_cdk.Duration.minutes(5)),
schedule=aws_events.Schedule.rate(
aws_cdk.Duration.seconds(data_generation_interval_sec)
),
)
rule.add_target(aws_events_targets.LambdaFunction(generator_lambda))

Expand Down Expand Up @@ -139,6 +142,7 @@ def __init__(
indexer_package_location: str,
searcher_package_location: str,
search_api_key: str | None = None,
data_generation_interval_sec: int = 300,
**kwargs,
) -> None:
"""If `search_api_key` is not set, the search API is not deployed."""
Expand Down Expand Up @@ -167,7 +171,13 @@ def __init__(
searcher_package_location=searcher_package_location,
)

Source(self, "Source", index_id=index_id, qw_svc=qw_svc)
Source(
self,
"Source",
index_id=index_id,
qw_svc=qw_svc,
data_generation_interval_sec=data_generation_interval_sec,
)

if search_api_key is not None:
SearchAPI(
Expand Down
4 changes: 2 additions & 2 deletions distribution/lambda/cdk/stacks/services/indexer_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ def __init__(
index_config_bucket: str,
index_config_key: str,
memory_size: int,
timeout: aws_cdk.Duration,
environment: dict[str, str],
asset_path: str,
**kwargs,
Expand All @@ -32,8 +33,7 @@ def __init__(
"QW_LAMBDA_INDEX_CONFIG_URI": f"s3://{index_config_bucket}/{index_config_key}",
**environment,
},
# use a strict timeout and retry policy to avoid unexpected costs
timeout=aws_cdk.Duration.minutes(1),
timeout=timeout,
retry_attempts=0,
reserved_concurrent_executions=1,
memory_size=memory_size,
Expand Down
3 changes: 3 additions & 0 deletions distribution/lambda/cdk/stacks/services/quickwit_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ def __init__(
indexer_package_location: str,
indexer_memory_size: int = DEFAULT_LAMBDA_MEMORY_SIZE,
indexer_environment: dict[str, str] = {},
# small default timeout to avoid unexpected costs and hanging indexers
indexer_timeout: aws_cdk.Duration = aws_cdk.Duration.minutes(1),
searcher_memory_size: int = DEFAULT_LAMBDA_MEMORY_SIZE,
searcher_environment: dict[str, str] = {},
**kwargs,
Expand All @@ -55,6 +57,7 @@ def __init__(
index_config_bucket=index_config_bucket,
index_config_key=index_config_key,
memory_size=indexer_memory_size,
timeout=indexer_timeout,
environment=indexer_environment,
asset_path=indexer_package_location,
)
Expand Down
Loading

0 comments on commit 0e4ae0b

Please sign in to comment.