Friday 11 September 2015

Live debugging with WinDBG and VMMap. Not managed, not unmanaged mystery

And so it happened. Because of the nature of the issue one of our customers has recently experienced I needed to figure out how to debug the issue in the live Windbg session. And it appeared to be quite easy and handy for the type of situation that we had.

So, the situation was like next: something was eating memory on the customer solution. When I got the memory dump and checked the memory sizes, this is what I've got:

Of 10Gb memory dump:

2Gb was managed memory
0.1Gb was unmanaged heap

and that's it. Windbg has shown only about 2Gb of managed and unmanaged memory alltogether, but what about the 8Gb? I felt so helpless, I've never seen anything like that. This mystery catched me so much that I was going to a party still thinking about what it can be.

Maybe a little drinking at a party helped, but after some tries and a couple of days the mystery was solved.

So, if you have a weird memory distribution, that's what you should do:

1. Download the beautiful VMMap

2. Attach to a w3wp process and look at the memory:


And right here I felt like I've found a way to solve the mystery. Look at Commited values. Managed heap is GC heap in memory dumps, things that you can debug nicely with !dumpheap and other handy commands. Heap is what is shown as unmanaged heap in memory dumps, which can be debugged not so nicely, but still possible.
And Private Data that takes more than a half of the memory for this process. Where is that even comes from?

3. Select the problematic area (in my case, Private Data) and observe the allocations. In my case it was obvious that there is a huge amount of allocations of 7220kb, 14440kb and another "pair" 7222 and 14444. Second value is always 2x the first.

It is related to the way how the memory is being allocated in Private Data. First it allocates the smaller amount and if it is not enough it allocates 2x more. I'm not really a queen of the details here, but this is how approximately it works.

Unsolved question for me is also why this memory is not shown up in Heap in the memory dump. It is probably a silly question for memory-gurus, but I couldn't find an answer.

4. Then it became obvious that I need to find out what exactly makes these allocations. I've read that memory is being allocated in Private Data when native VirtualAlloc() is being called. So the idea is to find those calls of VirtualAlloc that are allocating around 7220 and 14440 Kb of data.

NOTE: the values shown in VMMap are approximate! It gave me a bit of nightmares trying to figure out why I can't catch the exact memory allocations, and the answer was easy - they are just not really accurate.

5. After finding what makes the allocations (in my case it was VirtualAlloc), we need to catch the process of allocation in real time. This is how you can do it:

- Open Windbg, go to File – Attach to a process, select your w3wp process there.
- If process stops on some exception, write “g” to let it continue.
- Press Ctrl+Break or Ctrl+Pause to interfere
- Then, enter in Windbg window:

bp kernel32!virtualalloc ".if ((@edx>=0x70C000)&(@edx<=0x70F000))|((@edx>=0xE10000)&(@edx<=0xE1C000)) { .printf \"Allocating %lu bytes, which is 7,224k and is a sign of a memory leak\n\", @edx; kb 4 } .else {g}"
It will create a breakpoint on a native method that allocates around 7,224 (14,444 when doubled) kilobytes.

A bit of explanation:

Here we use kernel32!virtualalloc, because in my case that was the "guilty" method. But it can be anything else.

@edx in this case is a register, that contains the "amount to allocate" parameter passed to VirtualAlloc function.

0x70C000...0x70F000 is hexadecimal values for 7389184...7401472 bytes, which is 7216...7228 kilobytes. I tried to catch all amounts that are similar to 7224.

0xE10000...0xE1C000 are the same values, doubled. I used them too, because I saw such allocations in VMMap.

.else {g} - means that if this is just a usual VirtualAlloc with other amounts  of memory, then just continue execution.


- After you added the breakpoint, type “g” to continue.

- Wait until the process stops at the breakpoint. Note the “Allocating…” message – it is what we set in our breakpoint to print out.

- Type .loadby sos clr and press Enter. This will allow us to see the managed code.
- Type !clrstack.

Ideally, after all these actions we’ll see the stacktrace that lead to the bad memory allocation:



(my pictures are just examples, taken with another number of bytes from a running Sitecore instance)

- To see the objects of the stactrace, type !dso. It should give all the list of objects and printing out strings that you have there. This can also help to figure out where this code comes from.


Not ideally, we’ll get only some native stacktrace there. But anyway, then we can at least google for specific native calls and try to figure out where they came from.

- To delete all the breakpoints so that process continues, go to Edit – Breakpoints – Remove all.


The post lacks some pictures and details for now, so it's a post's beta-release ("cause it's beta than nothing").


Tuesday 4 September 2012

Meet Git


Last days I was struggling with Git and GitHub, as after work I sometimes make some custom projects and try to learn something new. So I'm going to keep in this post my own piece of advice on using git, as I've already stepped in one and the same situation several times :)


Permission denied (publickey):


If you try to clone a repository from the server to your Mac, and it says "Permission denied (publickey)", all you need to do is:
1. Show hidden files on Mac:

defaults write com.apple.finder AppleShowAllFiles TRUE

killall Finder

2. Check what you have in /Users/youruser/.ssh folder. In my case there was a github_rsa file.
3. Run
ssh-add ~/.ssh/github_rsa
It will say:
Enter passphrase for /Users/sitecore/.ssh/github_rsa:
But when you typing something, nothing appears. Don't worry - it's ok, just type your passphrase (password) and press Enter.

Voila! Identity added.

Clone only one branch:


git init
git remote add -t $BRANCH -f origin $REMOTE_REPO
git checkout $BRANCH

Standard git clone -b branchname - will clone full repository and just checkout you to needed branch. The same as if you did git clone (repository) and git checkout (branch)

Wednesday 18 July 2012

Investigating issues with locks

Tadammm

Last post from my "Windbg bible" series :) Let's finish it and then I'll go for something new and more complicated.
What does lock means? In simple words. I call "Lock" a situation, when several threads are requesting one and the same resource (object, file) when this resouce is already locked by some other thread. While this thread is holding a lock on a resource, while it is working on it - other threads are not able to touch it and just spend time in waiting mode.

What are the sympthoms of having lock-issue?

Monday 16 July 2012

Investigating High Processor Usage issues

If you only know how much time it took for me to make this post! First of all, it was hard to start it, as sometimes I am a bit bored of writing about things that I have done many times already. And in addition, first version of the post was washed away by incorrectly closed browser. Oh no!.. Let's start again...

Today we are talking about heavy CPU load. How it all begins?  Usually customers are complaining about the fact that his solution suddenly started to eat 100% CPU. Actually, it can be not only 100, but any value that is higher than standard CPU consumption for this process. Requests can be performed slowly or are not served at all.
How to gather dumps in these situations, is described here. But often it will be even better to gather dumps simply by clicking context menu on w3wp process in TaskManager.

Wednesday 11 July 2012

Investigating issues with Unmanaged Memory. First steps.


I want to write this blogpost really fast, until I forget everything, despite I have absolutely no time and need to work on other cases :( Let's count this as part of my work, okay?

Prologue.

You can scroll it down to scene 1, if you want to concentrate on concrete steps.

This case had a long story - starting from OOM troubles on 32-bit process, when process 32-bit mode was considered as main boundary for a large solution. BTW, my troubles with opening dumps at that point resulted in this blog post, if you remember it.
After switching to 64bit solution was running smoothly for some time, we even closed the support case. Suddenly (oh, I love this word!) our poor process had eaten 66Gb of memory, which is unbelievably huge. Actually, it was all of the memory available - 16Gb of RAM and 50Gb of page file.


Customer managed to gather a dump when process memory usage was at something like 6Gb. I was almost sure, that issue solution is struggling because of was this bad guy - Lucene analyzer. Luckily, we have script that counts all related to this issue data structures (and also counts cache sizes), so I thought that I will just run this script on the dump and confirm my assumptions.


I was a bit shocked when script has shown only 140Mb of Lucene-related things in memory. Cache sizes were in healthy range. So this is something new. I loaded our good old friend WinDBG and started analyzing memory as usually.

Wednesday 4 July 2012

Investigating dumps with exception


Dump that contains concrete exception is useful in case of "Sometimes"-issues. "Sometimes"-issue is an issue, which description is started from one of the words - sometimes, rarely, suddenly etc. Do you recall some of such cases, and how often they are closed as not-reproducible? Memory dump is your friend here :)

In case of any exception I would suggest to gather dump the same as in case for OOM exception, which is described here. Do not forget to load needed modules and symbols, and after that you are ready to start investigation.


Actually, dump with concrete exception is simplest situation.

Tuesday 3 July 2012

Pocket-size vocabulary


In this post I'm going to keep some useful WinDBG commands and tips. Simply and short :) I'll update it from time to time.


  • .time 

           Shows date and time when dump was created:




  • Opening a new dump. If you need to open another dump in same windbg window, choose Debug -> Stop debugging, or press Shift+F5.
  • Question mark makes conversion from hexadecimal to decimal and vice versa:
0:082> ?1e74
Evaluate expression: 7796 = 00001e74
0:082> ?0n7796
Evaluate expression: 7796 = 00001e74

0n is prefix for decimal numbers
0x is prefix for hexadecimal numbers
  • There is an autocompletion option in WinDBG: you may start writing !ana, then press Tab and it will be completed to !analyze
  • .chain is a command to look currently loaded extensions:
  • .unload is a command used to unload one of the currently loaded extensions. For example, to change it with another version of this extension. Make sure that you pass full name of the extension as a parameter - exactly in a form specified in .chain output:
  • !sam c:\temp This command will export all dlls from current dump to the location you've specified as a parameter. Note, that this is a command from psscor dlls, so if you have loaded sos instead of psscor, it won't work.


  • .logopen /t c:\temp\Output.txt - if you want to save your next commands output to a file.