Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA followed by process killed / return 137 #6978

Open
ericstj opened this issue Jan 30, 2024 · 2 comments
Labels
blocking-clean-ci Blocking PR or rolling builds bug Something isn't working Known Build Error Use this to report build issues in the .NET Helix tab untriaged New issue has not been triaged

Comments

@ericstj
Copy link
Member

ericstj commented Jan 30, 2024

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=530980&view=results
Build error leg or test failing: Microsoft.ML.TorchSharp.Tests Work Item
Pull Request #6976

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": [ "Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA", "+ export _commandExitCode=137" ],
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

System Information (please complete the following information):

  • OS & Version: Ubuntu 18.04
  • ML.NET Version: latest
  • .NET Version: .NET 6.0

Describe the bug
This test is failing in CI somewhat regularly. The error pattern looks like the following:

Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA
Killed
+ export _commandExitCode=137

Here are a few instances:
https://helixre107v0xd1eu3ibi6ka.blob.core.windows.net/dotnet-machinelearning-refs-pull-6974-merge-f61a125156aa4af1bd/Microsoft.ML.TorchSharp.Tests/1/console.83a6fa6c.log?helixlogtype=result
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-machinelearning-refs-pull-6976-merge-0a13c2cd41724c3483/Microsoft.ML.TorchSharp.Tests/1/console.ff57f777.log?helixlogtype=result

I can't currently capture this failure in a known issue because there is no unique line logged. I've seen this failure numerous times - always when TestSimpleQA is running.

Report

Build Definition Test Pull Request
867078 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
866895 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7282
866863 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7295
866886 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7298
865385 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7295
865969 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7262
865798 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7262
865238 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
865222 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7295
865077 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7293
864183 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7291
861417 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
860725 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
860514 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7274
860151 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
859988 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7266
858976 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7284
858626 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7283
857116 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7274
854633 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7266
854104 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
852919 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7280
850556 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
850412 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
850351 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7279
850353 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
850347 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7273
850172 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7278
848904 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7266
846819 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution
846450 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7272
845376 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7267
844972 dotnet/machinelearning Microsoft.ML.TorchSharp.Tests.WorkItemExecution #7272

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
5 12 33

Known issue validation

Build: 🔎
Result validation: ⚠️ Build internal information not found. This may happen if your build is too old. Please use a build that is no older than two weeks. If the problem persists, contact .NET Engineering Services Team and share this issue.
Validation performed at: 2/14/2024 10:25:46 PM UTC

@ericstj ericstj added bug Something isn't working blocking-clean-ci Blocking PR or rolling builds labels Jan 30, 2024
@ghost ghost added the untriaged New issue has not been triaged label Jan 30, 2024
@ericstj
Copy link
Member Author

ericstj commented Jan 30, 2024

@michaelgsharp made a good observation offline - we're seeing memory usage go up quite a bit as the tests progress.

Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity with memory usage 2,077,020,160.00 and max memory usage 2,370,473,984.00

That's using 2GB memory after the previous test completed.

@ericstj
Copy link
Member Author

ericstj commented Jan 31, 2024

Wow - the memory usage of this test is very high. Here's what I see from a local passing run on Windows.

  Discovering: Microsoft.ML.TorchSharp.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  Microsoft.ML.TorchSharp.Tests (found 12 test cases)
  Starting:    Microsoft.ML.TorchSharp.Tests (parallel test collections = on [20 threads], stop on fail = off)
Starting test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNer
Finished test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNer with memory usage 751,607,808.00 and max memory usage 751,607,808.00
Starting test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNerOptions
    Microsoft.ML.TorchSharp.Tests.NerTests.TestNERLargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
Finished test: Microsoft.ML.TorchSharp.Tests.NerTests.TestSimpleNerOptions with memory usage 895,778,816.00 and max memory usage 895,778,816.00
Starting test: Microsoft.ML.TorchSharp.Tests.ObjectDetectionTests.SimpleObjDetectionTest
total : 171, filtered: 0, filter ratio: 0.00%
Finished test: Microsoft.ML.TorchSharp.Tests.ObjectDetectionTests.SimpleObjDetectionTest with memory usage 1,142,628,352.00 and max memory usage 1,155,977,216.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence3Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence3Classes with memory usage 1,111,171,072.00 and max memory usage 1,155,977,216.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestDoubleSentence2Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestDoubleSentence2Classes with memory usage 1,352,704,000.00 and max memory usage 1,352,818,688.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence2Classes
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSingleSentence2Classes with memory usage 1,365,450,752.00 and max memory usage 1,366,872,064.00
Starting test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity
Finished test: Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarity with memory usage 1,362,817,024.00 and max memory usage 1,368,600,576.00
    Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestSentenceSimilarityLargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
    Microsoft.ML.TorchSharp.Tests.TextClassificationTests.TestTextClassificationWithBigDataOnGpu [SKIP]
      Condition(s) not met: "EnableRunningGpuTest"
Starting test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA
Finished test: Microsoft.ML.TorchSharp.Tests.QATests.TestSimpleQA with memory usage 4,675,801,088.00 and max memory usage 5,540,958,208.00
    Microsoft.ML.TorchSharp.Tests.QATests.TestQALargeFileGpu [SKIP]
      Needs to be on a comp with GPU or will take a LONG time.
  Finished:    Microsoft.ML.TorchSharp.Tests

So we may have some leak (this still shows growth) but we also are using a ton of memory when running this test.

@ericstj ericstj added the Known Build Error Use this to report build issues in the .NET Helix tab label Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocking-clean-ci Blocking PR or rolling builds bug Something isn't working Known Build Error Use this to report build issues in the .NET Helix tab untriaged New issue has not been triaged
Projects
None yet
Development

No branches or pull requests

1 participant