Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][STIR_SHAKEN] stir_shaken_verify random issue load_cert/PEM_X509_INFO_read_bio #3446

Open
Integration-IT opened this issue Aug 22, 2024 · 5 comments
Labels

Comments

@Integration-IT
Copy link
Contributor

Version:

version: opensips 3.4.6 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 0f0d08dba

Issue:
In previous tests, stir_shaken_verify randomly did returned an -1 ret code with an internal error during the checks.
the PR aspire to return more explicite information to hook a path of investigation.

After inspection we have new informations with an explicit error reason.

ERROR:stir_shaken:load_cert: error reading certificate stack
ERROR:stir_shaken:w_stir_verify: Failed to load certificate
ERROR:SCRIPT:DBG: [REQUEST ROUTE][CHECK-STIR-SHAKEN] | -1: Internal error | err_code: 500 | err_reason: Failed to load certificate.

Starting point :

  • w_stir_verify
  • load_cert

Entry point error:

  • error reading certificate stack

Other info:

  • It is not possible to replicate the test with unit calls.
  • The issue occurs randomly depending on traffic flow and the burst of checks.
  • ca_list param is defined to a bundle file containing CAroot, Chain1,Chain2, Issuer.
@Integration-IT
Copy link
Contributor Author

update.

focus on load_cert function, the 'error reading certificate stack' point here:

		sk = PEM_X509_INFO_read_bio(cbio, NULL, NULL, NULL);
		if (!sk) {
			LM_ERR("error reading certificate stack\n");
			X509_free(*cert);
			*cert = NULL;
			BIO_free(cbio);
			sk_X509_free(stack);
			return -1;
		}

To change the sk init, only cbio is relevant. Next step is to very cbio value before call PEM_X509_INFO_read_bio.

void log_cert(const str *cert_buf)
{
    static str cbuf;
    char *p, *end, *w;
    unsigned h;

    if (!cert_buf) {
        LM_ERR("NULL cert_buf pointer!!\n");
        return;
    }

    if (ZSTR(*cert_buf)) {
        LM_ERR("zero-string in cert_buf!! (%p %d)\n", cert_buf->s, cert_buf->len);
        return;
    }

    if (pkg_str_extend(&cbuf, 2 * cert_buf->len + 1) != 0) {
        LM_ERR("oom PKG\n");
        return;
    }

    w = cbuf.s;
    for (p = cert_buf->s, end = p + cert_buf->len; p < end; p++) {
        sprintf(w, "%02x", *(unsigned char *)p);
        w += 2;
    }
    *w = '\0';

    h = core_hash(cert_buf, NULL, 0);
    //LM_ERR("cert_buf: %u (%s)\n", h, cbuf.s);
    LM_ERR("cert_buf hash: %u\n", h);
}

		log_cert(cert_buf);
		sk = PEM_X509_INFO_read_bio(cbio, NULL, NULL, NULL);
		if (!sk) {
			LM_ERR("error reading certificate stack\n");
			X509_free(*cert);
			*cert = NULL;
			BIO_free(cbio);
			sk_X509_free(stack);
			return -1;
		}

Unfortunately, the "error reading certificate stack" continue where cbio hashes have been identified similar.
The hexadecimal encoded binary cert_buf displayed and returned by the log_cert function was the same for stir_verify success and stir_verify error.
At this point cbio has same integrity with or without certificat stack issue.

@Integration-IT
Copy link
Contributor Author

update.

To avoid race condition around sk and PEM_X509_INFO_read_bio. The idea is to clean the openssl error structure and force sk to NULL value before the init. Then after PEM_X509_INFO_read_bio try to check if we can find any error in the stack.

Error handler:

static int openssl_get_errstack(void)
{ 
    int code;
    code = ERR_get_error();
    return(code);
}

Load cert:

		sk = NULL;
		ERR_clear_error();
		sk = PEM_X509_INFO_read_bio(cbio, NULL, NULL, NULL);

		// any error after PEM_X509_INFO_read_bio ?
		err_check = openssl_get_errstack();
		if ( err_check != 0 ) {
			LM_ERR("PEM_X509_INFO_read_bio error detected: %d\n", err_check);
			LM_ERR("PEM_X509_INFO_read_bio error inspection: %s\n", ERR_error_string(err_check, 0));
		}
		
		if (!sk) {
			LM_ERR("error reading certificate stack\n");
			X509_free(*cert);
			*cert = NULL;
			BIO_free(cbio);
			sk_X509_free(stack);
			return -1;
		}

After a while the problem persists randomly, but in some cases it is possible to deal with an error code. This error completely troubles me because we previously checked the integrity of cbio. Next test is to check the cbio before and after the PEM_X509_INFO_read_bio !!!

Core log:

ERROR:stir_shaken:load_cert: PEM_X509_INFO_read_bio error detected: 151584876.
ERROR:stir_shaken:load_cert: PEM_X509_INFO_read_bio error inspection: error:0909006C:PEM routines:get_name:no start line. 

@Integration-IT
Copy link
Contributor Author

update.

Resume:

  • cbio hash still the same before and after the PEM_X509_INFO_read_bio.
  • sk could be null after the PEM_X509_INFO_read_bio for unknown reason.
  • sometimes an out of scope error is flaged in the openssl error stack after PEM_X509_INFO_read_bio even if the certificate hash is compliant. Could be threading or context issue ?
  • sometimes 0 error value error is returned and sk still not initialized.
  • non predictable case over large amount of trafic.
  • may happen rarely from 100/150 checks per second.

WIP - Let a chance to init sk with several attempts before return an error.

		looper = 0;
		sk = NULL;
		while (looper < 5 && !sk) {
			looper++;
			sk = PEM_X509_INFO_read_bio(cbio, NULL, NULL, NULL);
			if (!sk) {
				// trying to solve the random init in a non-clever way until ssl lib inspection.
				LM_ERR("bogus error reading certificate stack returned by PEM_X509_INFO_read_bio, try again\n");
			}
		}

		if (!sk) {
			LM_ERR("error reading certificate stack\n");
			X509_free(*cert);
			*cert = NULL;
			BIO_free(cbio);
			sk_X509_free(stack);
			return -1;
		}

@Integration-IT
Copy link
Contributor Author

WIP - partial FIX - no more internal error "Failed to load certificate".

10:07:36 ERROR:stir_shaken:load_cert: bogus error reading certificate stack returned by PEM_X509_INFO_read_bio, try again
10:11:58 ERROR:stir_shaken:load_cert: bogus error reading certificate stack returned by PEM_X509_INFO_read_bio, try again
10:14:03 ERROR:stir_shaken:load_cert: bogus error reading certificate stack returned by PEM_X509_INFO_read_bio, try again
10:23:55 ERROR:stir_shaken:load_cert: bogus error reading certificate stack returned by PEM_X509_INFO_read_bio, try again

vector:

  • could be openssl thread queue issue management.
  • could be random sk init under heavy load

Copy link

github-actions bot commented Sep 8, 2024

Any updates here? No progress has been made in the last 15 days, marking as stale. Will close this issue if no further updates are made in the next 30 days.

@github-actions github-actions bot added the stale label Sep 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant