Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soda Checks Caches results for independent checks When Running Programmatically when it shouldnt #2145

Open
jancakst opened this issue Aug 8, 2024 · 2 comments

Comments

@jancakst
Copy link

jancakst commented Aug 8, 2024

Issue: Soda Checks Cache Issue When Running Programmatically

I need to programmatically run Soda checks over different Soda check YAML files to identify the source data type from a finite list of possible data sources. I don't want to execute all of them simultaneously because I am interested in finding which Soda check will pass. The passing YAML file will indicate which data source I am dealing with.

Example YAML Files:

  • datasource1_definition_check.yml
  • datasource2_definition_check.yml

Problem Description:

When I try to run YAML checks in a loop, an issue arises. For example, if I am trying to identify datasource2, my logic dictates that when I execute the Soda check with datasource2_definition_check.yml, it should pass. However, when I first execute a scan with datasource1_definition_check.yml (which fails as expected) and then follow it with datasource2_definition_check.yml, it appears that there is a cached state. My logs indicate that datasource2_definition_check.yml has also failed. Upon closer inspection, the logs show that datasource2_definition_check.yml has used the previous checks.

Code Snippet:

from soda.scan import Scan

SOFTWARE_CHECKS = {"Data source 1": ["datasource1_definition_check.yml"], "Data source 2": ["datasource2_definition_check.yml"]}
SODA_CONFIG = "configuration.yml"

def get_soda_scan(soda_checks_dir: str, check_file_path: str) -> "Scan":
    scan = Scan()
    scan.set_data_source_name("prepdb")

    config_path = f"{soda_checks_dir}/{SODA_CONFIG}"
    scan.add_configuration_yaml_file(config_path)
    print(f"config path: {config_path}")

    scan.add_sodacl_yaml_file(check_file_path)
    print(f"soda check file path: {check_file_path}")

    return scan

def run_checks(soda_checks_dir) -> str | None:
    for software, files in SOFTWARE_CHECKS.items():
        for file in files:
            check_file_path = f"{soda_checks_dir}/{file}"
            scan = get_soda_scan(soda_checks_dir, check_file_path)

            print(f"Trying to run tests for {check_file_path}")
            scan.execute()

            scan.set_verbose(True)

            print(scan.get_logs_text())

            print(f"Has check fails: {scan.has_check_fails()}\n{scan.get_checks_fail_text()}")

            print(f"Failed check for {software} with file:\n\n{file}")

if __name__ == "__main__":
    db_name = "some_test_db"
    soda_checks_dir = f"./soda_checks/{db_name}"

    run_checks(soda_checks_dir)

Steps to Reproduce:

  1. Prepare the YAML check files (datasource1_definition_check.yml and datasource2_definition_check.yml).
  2. Run the provided script to execute the checks sequentially.
  3. Observe that the second check seems to use a cached state from the first check.

Expected Behavior:

The second check (datasource2_definition_check.yml) should pass independently of the first check's result.

Actual Behavior:

The second check fails, indicating it uses the cached state from the first check. The first check has 4 checks in total, the second check has 5 schema checks (4 fail and 1 warn). After the second loop execution I am expecting to see 4 pass and 1 warning, but the logs show this:

...
INFO   | Oops! 4 errors. 0 failure. 1 warning. 4 pass.
Has check fails: False

Failed check for Data source 2 with file:

This is clearly incorrect as second execution should not include the outcome of the first check.

Additional Context:

This issue might be related to the caching mechanism within the Soda scan library. Clearing or isolating the state between scans could resolve the issue.

@tools-soda
Copy link

CLOUD-8233

@jancakst jancakst changed the title When iterating over multiple soda check files soda Scan remembers old executions and makes checks fail despite che Soda Checks Caches results for independent checks When Running Programmatically when it shouldnt Aug 8, 2024
@jancakst
Copy link
Author

jancakst commented Aug 8, 2024

I noticed another weird behaviour.

I want to check if any of the checks has failed when I perform a schema validation where I check if particular columns exist. The thing is that for some checks those tables may not even be present, in the logs this produces the following:

[13:24:47] Metrics 'schema' were not computed for check 'schema'

Given that the table wasn't present, to me the logical thing would be to ensure that the scan.has_check_fails() method gives True, but it is not the case. So even tho those checks failed as table was not present, it still gives False. After closer inspection, I noticed that in the logs if table name is not present the

for check in scan._checks:
    print(check.outcome)

produces None.

When you look at the implementation of has_check_fails() method it checks only against the presence of CheckOutcome.FAIL and it does not check for the presence of None:

    def has_check_fails(self) -> bool:
        for check in self._checks:
            if check.outcome == CheckOutcome.FAIL:
                return True
        return False

Based on this the has_check_fails method incorrectly thinks that the check has passed although it was not even performed due to missing table names. In logs it clearly states that an error occurred (see below) and that the check has failed, but this method states the opposite

INFO   | 4 checks not evaluated.
INFO   | 4 errors.
INFO   | Oops! 4 errors. 0 failures. 0 warnings. 0 pass.

This error can be easily fixed by modifying the method to something like this:

    def has_check_fails(self) -> bool:
        for check in self._checks:
            if check.outcome == CheckOutcome.FAIL or check.outcome is None:
                return True
        return False

But would be good to know what others think

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants