Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DwarfCompilationUnit.ReadData causes excessive memory usage on large binaries #46

Open
dedmen opened this issue Oct 14, 2021 · 1 comment

Comments

@dedmen
Copy link

dedmen commented Oct 14, 2021

All the Attributes parsed in DwarfSymbolProvider.DwarfCompilationUnit.ReadData are not deduplicated/interned.
For a big binary (in my case with debug info about 900MB) this will cause extreme memory usage.
Within the first 100 compilation units my memory usage rises to 12GB and then it gets stuck there because I ran out of memory.

As a ultra ugly hotfix I added this in DwarfSymbolProvider.ParseCompilationUnits

public class StringInterner
    {
        // deduplicate strings
        // meh https://github.com/dotnet/runtime/issues/21603 https://stackoverflow.com/questions/7760364/how-to-retrieve-actual-item-from-hashsett 
        ConcurrentDictionary<object, object> stringBank = new ConcurrentDictionary<object, object>();

        public object InternObject(object str)
        {
            if (str == null) return str;

            if (stringBank.TryGetValue(str, out var result))
            {
                return result;
            }

            stringBank.AddOrUpdate(str, str, (x,y)=> x);
            return str;
        }
    }
private static DwarfCompilationUnit[] ParseCompilationUnits(byte[] debugData, byte[] debugDataDescription, byte[] debugStrings, NormalizeAddressDelegate addressNormalizer)
        {
            using (DwarfMemoryReader debugDataReader = new DwarfMemoryReader(debugData))
            using (DwarfMemoryReader debugDataDescriptionReader = new DwarfMemoryReader(debugDataDescription))
            using (DwarfMemoryReader debugStringsReader = new DwarfMemoryReader(debugStrings))
            {
                List<DwarfCompilationUnit> compilationUnits = new List<DwarfCompilationUnit>();

                StringInterner interner = new StringInterner();

                List<Task> tasksList = new List<Task>();

                while (!debugDataReader.IsEnd)
                {
                    DwarfCompilationUnit compilationUnit = new DwarfCompilationUnit(debugDataReader, debugDataDescriptionReader, debugStringsReader, addressNormalizer, interner);

                    tasksList.Add(Task.Run(() =>
                    {
                        // intern all attributes in seperate threads

                        foreach (var compilationUnitSymbol in compilationUnit.Symbols)
                        {
                            compilationUnitSymbol.Attributes = 
                                compilationUnitSymbol.Attributes
                                    .Select(x => new KeyValuePair<DwarfAttribute, DwarfAttributeValue>(x.Key, interner.InternObject(x.Value) as DwarfAttributeValue))
                                    .ToDictionary(x => x.Key, x => x.Value);
                        }
                    }));




                    compilationUnits.Add(compilationUnit);
                }

                Task.WaitAll(tasksList.ToArray());

                return compilationUnits.ToArray();
            }
        }

This keeps my memory usage at the 400th compilation unit down at 7.7GB which is atleast usable.
I originally did the interning in DwarfCompilationUnit.data but that took too much time, the data reading is already the performance bottleneck, better not add anything extra to it.
Moving it out into a seperate thread/task works well for me so far.
One could probably intern the whole attribute instead of just the attribute value, not sure if that would be better, I assume it won't.

@dedmen
Copy link
Author

dedmen commented Oct 14, 2021

The next way more elaborate step would be noticing that almost all CU's have the std:: namespace, with over and over again the same symbols inside it that can all be deduplicated (the only things that are different are the offsets), but oof

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant