Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-5862] Incorrect semantics of ARRAY function (Spark library) when elements have Numeric and Character types #3324

Merged
merged 1 commit into from
Sep 12, 2023

Conversation

chucheng92
Copy link
Member

@chucheng92 chucheng92 commented Jul 19, 2023

What is the purpose of the change

https://issues.apache.org/jira/browse/CALCITE-5862

Brief change log

Fix incorrect spark semantics of array function when array elements have Numeric and Character types.

Verifying this change

SqlOperatorTests#testArrayFunc()

@chucheng92 chucheng92 changed the title [CALCITE-5862] Fix incorrect semantics of array function enabled in Spark Library [CALCITE-5862] Incorrect semantics of array function when elements have Numeric and Character types(enabled in Spark Library) Jul 19, 2023
@chucheng92 chucheng92 changed the title [CALCITE-5862] Incorrect semantics of array function when elements have Numeric and Character types(enabled in Spark Library) [CALCITE-5862] Incorrect spark semantics of array function when elements have Numeric and Character types Jul 19, 2023
List<RelDataType> nullTypes = new ArrayList<>();
for (int i = 0; i < opBinding.getOperandCount(); i++) {
SqlTypeFamily family = opBinding.getOperandType(i).getSqlTypeName().getFamily();
if (family == SqlTypeFamily.NUMERIC) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be rewritten as a switch case?

Copy link
Contributor

@tanclary tanclary left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments. Thanks for filing and opening a fix for this bug!

}
}

boolean onlyNumericCharacterTypes = opBinding.getOperandCount() > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misunderstanding, but if you only care about the size of the lists and not the elements themselves, you could just use boolean flags to keep track of which types have appeared. I think it will simplify the expression below.

Copy link
Contributor

@tanclary tanclary Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final List<RelDataType> types = opBinding.collectOperandTypes();
boolean hasNumeric = false;
boolean hasCharacter = false
 types.forEach(type -> {
			// null check here maybe
			switch (type.getSqlTypeName().getFamily())  {
			 case x:
			 		hasNumeric = true;
			 		break;
			 case y:
			 		hasCharacter = true;
			 		break;
			 default:
		 }
		 if hasNumeric || hasCharacter || c {....} ```
 you could do something like this maybe? (Excuse the pseudocode it's just an example)

Copy link
Member Author

@chucheng92 chucheng92 Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are right. I have fixed this. PTAL~

@chucheng92
Copy link
Member Author

@tanclary thanks for reviewing. I will address these comments later.

@chucheng92 chucheng92 force-pushed the array_fix branch 2 times, most recently from 65f4e2f to b93fd1c Compare July 20, 2023 08:24
Copy link
Contributor

@JiajunBernoulli JiajunBernoulli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think array must have same type.

You should use TypeCoercionImpl to solve your problem.

@chucheng92
Copy link
Member Author

chucheng92 commented Jul 23, 2023

I think array must have same type.

You should use TypeCoercionImpl to solve your problem.

Hi, Jiajun. thanks for reviewing.

However, this PR does not modify the SameOperandTypeChecker constraint of array. In fact, the current SameOperandTypeChecker allows character and numeric to exist at the same time. It's just that inferReturnType does not handle the simultaneous existence of character and numeric. please see:

  public static final SqlFunction ARRAY =
      SqlBasicFunction.create("ARRAY",
          SqlLibraryOperators::arrayReturnType,
          OperandTypes.SAME_VARIADIC);

if (!SqlTypeUtil.isComparable(types[i], types[prev])) {

The SameOperandTypeChecker will handle the special case of character, which itself allows character and numeric to coexist.

so currently run select array(1, 2, 'Hi'); can pass the OperandTypeChecker but with NPE in return type checker. This pr is used to fix the inconsistent behavior of operandTypeChecker and returnTypeInference. WDYT?

@chucheng92
Copy link
Member Author

chucheng92 commented Jul 23, 2023

btw, this PR just modify array function with Spark semantics, and the arrayReturnType is a private method just for spark semantics. This PR modify the spark arrayReturnType to solve this bug, I think it's a reasonable solution. @tanclary @JiajunBernoulli WDYT? Looking forward to your opinions.

@chucheng92 chucheng92 requested a review from tanclary July 27, 2023 12:50
@chucheng92 chucheng92 changed the title [CALCITE-5862] Incorrect spark semantics of array function when elements have Numeric and Character types [CALCITE-5862] Incorrect semantics of array function when elements have Numeric and Character types (enabled in Spark Library) Aug 1, 2023
@chucheng92 chucheng92 force-pushed the array_fix branch 2 times, most recently from 9846350 to 009e2d6 Compare August 2, 2023 02:22
@chucheng92 chucheng92 changed the title [CALCITE-5862] Incorrect semantics of array function when elements have Numeric and Character types (enabled in Spark Library) [CALCITE-5862] Incorrect semantics of ARRAY function(Spark library) when elements have Numeric and Character types Aug 2, 2023
@chucheng92
Copy link
Member Author

chucheng92 commented Aug 14, 2023

@JiajunBernoulli @tanclary Hi, Sorry to ping you. It has been a long time for this PR.

Do you have any other concerns? If you have any comments, I hope to receive your feedback. If there are some problems in my implementation, maybe you can give some specific opinions. I will appreciate it.

@tanclary
Copy link
Contributor

So to clarify: you can have an array with multiple types, but only if they are NUMERIC/CHARACTER? What if you have another combination like [1, Date '2008-12-25', 'a']?

@chucheng92
Copy link
Member Author

chucheng92 commented Aug 15, 2023

So to clarify: you can have an array with multiple types, but only if they are NUMERIC/CHARACTER? What if you have another combination like [1, Date '2008-12-25', 'a']?

@tanclary thanks for reviewing. I have updated the PR, if you help time, pls help to re-check it.

btw, the cases which have multi-types just like the case '[1, Date '2008-12-25', 'a']' can't enter in arrayReturnType method above, because it will fail in SameOperandTypeChecker. but if we call separately, we do need other types check.

@chucheng92
Copy link
Member Author

chucheng92 commented Sep 7, 2023

hi, @tanclary It’s been a long time since last review, could we continue on this PR (if you have time)? thanks.

@tanclary
Copy link
Contributor

tanclary commented Sep 7, 2023

@chucheng92 Yes I think this looks good but I will wait for @JiajunBernoulli to approve/give + 1 because they also had some comments if I remember correctly.

@JiajunBernoulli
Copy link
Contributor

@chucheng92 , Compared to NPE, the is indeed an improvement.

LGTM, remember to squash your commits before merging.

@chucheng92 chucheng92 force-pushed the array_fix branch 2 times, most recently from f28e385 to 5f632c0 Compare September 8, 2023 09:54
@chucheng92
Copy link
Member Author

@tanclary @JiajunBernoulli thanks for your patient review. all comments resolved. @tanclary If you have time, could you help to merge this?

@tanclary tanclary changed the title [CALCITE-5862] Incorrect semantics of ARRAY function(Spark library) when elements have Numeric and Character types [CALCITE-5862] Incorrect semantics of ARRAY function (Spark library) when elements have Numeric and Character types Sep 11, 2023
@tanclary
Copy link
Contributor

@chucheng92 Hi yes, would you mind editing the commit message to add a space between "function" and "(Spark library)"? Apologies for the nit. Re-ping me after and I will help merge 👍

…when elements have Numeric and Character types
@sonarcloud
Copy link

sonarcloud bot commented Sep 12, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

93.0% 93.0% Coverage
0.0% 0.0% Duplication

@chucheng92
Copy link
Member Author

@chucheng92 Hi yes, would you mind editing the commit message to add a space between "function" and "(Spark library)"? Apologies for the nit. Re-ping me after and I will help merge 👍

Thanks, fixed it.

@tanclary tanclary merged commit 2cb8167 into apache:main Sep 12, 2023
18 checks passed
@MasseGuillaume
Copy link
Contributor

This PR introduce a regression when building an array with structs.

I defined in my codebase named_struct that can return a struct type:

public class NamedStructUtils {

  private NamedStructUtils() {
    throw new UnsupportedOperationException("This is a utility class and cannot be instantiated");
  }

  public static SqlBasicFunction create() {
    return SqlBasicFunction.create(
      "NAMED_STRUCT",
      NamedStructUtils::returnType,
      OPERAND_TYPE_CHECKER);
  }

  private static String namedStructRequiresLiteralStringKeysGotExpression() {
    return "named_struct key is not a string literal, got an expression";
  }

  private static String namedStructRequiresLiteralStringKeysGotOtherType(String tpe) {
    return String.format("named_struct key is not a string literal, found type '%s'", tpe);
  }

  private static String namedStructRequiresEvenNumberOfArgs() {
    return "named_struct requires an even number for arguments";
  }
  
  private static RelDataType returnType(SqlOperatorBinding operatorBinding) {
    final List<String> keys = new ArrayList<String>();
    final List<RelDataType> values = new ArrayList<RelDataType>();
    int i = 0;
    for (RelDataType type : operatorBinding.collectOperandTypes()) {
      if (i % 2 == 0) {
        Object key = operatorBinding.getOperandLiteralValue(i, Object.class);
        if (key == null) {
          throw new CalciteException(
            namedStructRequiresLiteralStringKeysGotExpression(), null);
        }
        if (key instanceof NlsString) {
          keys.add(((NlsString) key).getValue());
        } else {
          String tpe = key.getClass().getSimpleName();
          throw new CalciteException(
            namedStructRequiresLiteralStringKeysGotOtherType(tpe), null);
        }

      } else {
        values.add(type);
      }
      i++;
    }
    final RelDataTypeFactory typeFactory = operatorBinding.getTypeFactory();
    return typeFactory.createStructType(values, keys);
  }

  private static final SqlOperandTypeChecker TWO_OR_MORE =
      OperandTypes.variadic(SqlOperandCountRanges.from(2));

  private static final SqlOperandTypeChecker OPERAND_TYPE_CHECKER =
      new SqlOperandTypeChecker() {
        @Override public boolean checkOperandTypes(
            SqlCallBinding callBinding,
            boolean throwOnFailure) {
          final boolean isRangeOk = TWO_OR_MORE.checkOperandTypes(callBinding, throwOnFailure);
          final int operandCount = callBinding.operands().size();
          final boolean isEvenNumberOfArgs = operandCount % 2 != 0;

          if (throwOnFailure && isEvenNumberOfArgs) {
            throw new CalciteException(namedStructRequiresEvenNumberOfArgs(), null);
          }

          return isRangeOk && isEvenNumberOfArgs;
        }

        @Override public SqlOperandCountRange getOperandCountRange() {
          return TWO_OR_MORE.getOperandCountRange();
        }

        @Override public String getAllowedSignatures(SqlOperator op, String opName) {
          return opName + "(...)";
        }
      };
}

Since SqlTypeName.ROW has type family null: ROW(PrecScale.NO_NO, false, Types.STRUCT, null),
it fails the check requireNonNull(family, "array element type family");

eg

array(named_struct('k', 'v'))

@chucheng92
Copy link
Member Author

@MasseGuillaume thanks for reporting this. I will try to reproduce/fix this.

@chucheng92
Copy link
Member Author

opened a ticket to track this: https://issues.apache.org/jira/browse/CALCITE-6127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants