March 27, 2012

Apparatus Update

Couple of quick notes:

  • System has been behaving as intended (no black screen/sound loop crashes) since I replaced the GTX 570 video card with a GT 8800.  Hopefully, when CyberpowerPC sends back a new GTX 570, the problem will not return.  At this point, I am extremely leaning (99.999999999999%) towards a video card hardware problem causing my Skyrim issues.
  • I'm using 12 Skyrim mods now (can't remember them off hand, but mostly graphics related) and it's working awesomely.  The only thing that would make it more awesome is if I could use the GTX 570 that I paid for. :-)  So, yes, anxiously awaiting for the RMA process.
  • I'd love to use Skyrim UI since the inventory menus are so badly designed, but I don't want to deal with the hassle of continually updating SKSE.
  • Eagerly waiting for the first Skyrim DLC
  • Still looking for a good Bokeh mod that can be installed from the Steam Workshop.  I could use the Nexus ones, but I don't want the hassle, similar to maintaining SKSE.  Is that asking too much?


Bokeh

March 20, 2012

Liking CyberpowerPC

I'm really liking CyberpowerPC.  They consistently respond to my emails with quality customer service response, even if the answer is not positive (i.e. they don't do advance RMAs).  That, along with great support while purchasing and quick, quality shipping, will make me buy my next system from them.  After that I got this response to an email inquiring about the RMA process:
As we do not repair the video cards here at Cyberpower we will be replacing the video card all together. Once we receive your video card for replacement you will be notified by email.
Nice!  This should fit right in with validating if the problem is the video card or not.  In the meantime, I have a 8800GT 512MB video card and I'll be testing that to no end to see if it produces the same crash in Skyrim.


March 18, 2012

More On Video Card

After several more hours of testing and researching, I had a few more observations:

  • Skyrim utilizes DirectX 9 and possibly some DirectX 10.
  • Grid utilizes DirectX9 and possibly some DirectX 10.
  • Battlefield 3 utilizes DirectX 11
  • Batman: Arkham City utilizes DirectX 11

So far I've only experienced problems with Skyrim and Grid (the latter only in the past two days).  The crashes are the same: video feed goes out, sound loops, system unrecoverable.  This can happen after several hours or after just a minute (or even straight from power off, boot, game load, crash).  The usual suspects are out:

  • Heat (measured over and over, max GPU is like 72C and max CPU is 68C.  This was extreme testing.  Normal operations is about 35C CPU and 42C GPU lower.
  • Overclocking - the crashes are seen without overclocking, with overclocking and with underclocking.  It ain't the overclocking, okay?  I tested it.
  • Malware - no malware on the system.
  • System errors - NO ERRORS IN THE OS EVENT LOGS.
  • Power Supply - replaced 750W power supply with a Silverstone 1000W power supply

Now, the fact that no issues are seen in Battlefield 3 and Batman and only the games NOT using DX11 led me to this article which describes how installing an older DirectX 9 game can muck up your DirectX 10 but leave DirectX 11 intact.  Weird, that sounds like my issue!

Quote from article:
Obviously, something is wrong with the DirectX 9 redistributable package. However, I don't think Microsoft can do anything about this since the DirectX 9 redistributable is everywhere and has been in use for ages. We just want to warn you about this problem so you won't need to scratch your head if your DirectX 10 game does not work. It could be because you just installed a DirectX 9 game. Try reinstalling the DirectX 10 redistributable.
Well, I tried the fix (install DirectX 10 runtime manually), but to no avail.  I still get the crashes in Skyrim and (now) Grid.  YARH (Yet Another Red Herring)?  Likely.

Frustrated and unable to get away from the observations:

* Crashes in DX9/10 games (Skyrim & Grid)
* Does not crash (yet) in DX11 games
* Just started crashing in Grid a few days ago
* No errors found using OCCT, Furmark, Heaven, etc, etc, etc
* No OS errors (this means the OS does not even catch the error, indicating it is likely NOT the video driver).
* Was able to change the crash slightly by disabling Timeout Detection Recovery (TDR).

All of the above points to a video card that is not behaving properly. 

Frustrated, I removed the Gigabyte GTX 570 from my system for packing up for the RMA and I found an older Nvidia Quadro FX 4600 sitting around, so I installed that.  I tried and tried to replicate the crashes, but I could not.  Changing the video card eliminated the observed crashes.  The logic is inescapable: it's not software, it's not my setup, it's not my configuration.  The only variable that changed was the video card itself.  Observation: video card is causing the problem. 

Guilty? We shall see.

It's weird that this would indicate that so many people have faulty video cards.  But then again, is it really so many?  In hindsight, no.  Sure, there are a lot of forum posts about it, but then again there are probably a lot of forum posts about....well, anything.  More than half of the posts tend to be trollers, leaving only a few with the same actual issues.

And then I remembered how my video card got a low rating (around 68%) for the ASIC in GPU-Z.  Perhaps it's true: Skyrim does push your hardware in ways other games and stress tests do not.  Here's another interesting fact: Had I not played Skyrim, I would have had zero issues with this system.  Well, at least I think so.

This is a very good video card.  When it works.
Hopefully this is the final test.  It is important to note that prior to the video card swap, I was already up to over 10 crashes in one day.  It was getting crazy.  After the swap and after several hours in Skyrim, loading saved games that would consistently crash, I got zero crashes with the older card.  

The GTX 570 now sits in a shipping box, ready for RMA.  However long it takes to get there and for CyberpowerPC to send me a replacement will hopefully be worth it if I've found the root cause of the problems for the last four months.  Now I just hope CyberpowerPC doesn't get the card, run Furmark against it and then send it back to me.  If that happens, I may just have to get another video card altogether.

"Video card, huh?  You were losing the video feed?  And then crashing?  Must be tough to be so frigging smart!"



March 17, 2012

TDR: Helpful. Next Stop: Video Card

After the TDR disabling, the crashes still happened, but more slowly.  In what previously was a quick video feed shutoff followed by a buzzing that was the last 1/2 second of audio looping now became a video feed shutoff, audio still playing with some distortion and the keyboard still responding.  I was unable to gain control though, since in about 10 seconds the keyboard became unresponsive and the audio eventually became the same buzzing.

This is at least something in the crashing that changed - and it means... something.  Previous changes were only on the frequency of the crashing, but this is the first time I was able to change the crashing method.  Another interesting change was I was able to get a single Windows Event Log entry:

Log Name: System
Source: nvlddmkm
Event ID: 14
Level: Error
Keywords: Classic
Task Category: None


Binary data:
In Words
0000: 00000000 00300002 00000000 C0AA000E
0008: 00000000 00000000 00000000 00000000
0010: 00000000 00000000


In Bytes
0000: 00 00 00 00 02 00 30 00   ......0.
0008: 00 00 00 00 0E 00 AA C0   ......ªÀ
0010: 00 00 00 00 00 00 00 00   ........
0018: 00 00 00 00 00 00 00 00   ........
0020: 00 00 00 00 00 00 00 00   ........


The description for Event ID 14 from source nvlddmkm cannot be found.
Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.


If the event originated on another computer, the display information
had to be saved with the event.


The following information was included with the event:


\Device\Video5
An uncorrectable double bit error (DBE) has been detected on GPU (03 03 03).


What this means to me is that the crash is centered on the video card.  Probably bad memory chips on the video card.  I ran GPU-Z and I got 68% on the ASIC - not clear on what that means, but the general consensus on the Internet is that the higher the percentage, the higher the quality of the video card build.

Shortly after this, with TDR still disabled, I saw the same crash in Grid.

Conclusion #1: Having TDR enabled is good, because it catches most crashes, preventing the system from going down.

Conclusion #2: Having TDR enabled is bad, because it hides these crashes from us and only when a crash is not recoverable (i.e. in Skyrim) do we think there is a problem.

Obviously #2 causes us to run down some crazy rabbit holes, while technically related, is really not something I can do anything about.

I've sent in an RMA request for the video card.  So we'll see what happens next...

March 16, 2012

Next Stop: TDR

Well, the crashes continue.  I've started a new Skyrim character and while it works fine 99% of the time, there's still the occasional jarring crash.  I've experimented with hardware, bought a new PSU, tried to work with Bethesda, all to no avail.

While I continued researching, trying to find commonalities between the reported crashes, I ran across something called "TDR" (Timeout Detection and Recovery) used in Microsoft Vista and beyond.

Right now, I'm going to experiment with disabling this feature.  The theory is that if the video card is busy, it cannot respond to the OS in the time and the OS will then send a restart command to the video card.  Right in the middle of Skyrim, this cannot be good.

To disable TDR in Windows Vista and Windows 7, either change or add the registry settings described in the link above.  For me, I set "TdrLevel" to "0" to disable it altogether.  I'd rather get a blue screen than what I'm getting now.

March 08, 2012

Bethesda Support Findings: null

Bethesda support was nice enough to exchange at least 11 emails with me, but alas, it led nowhere.  Here's their last response (emphasis mine):

The links you have provided are only anecdotal evidence at best. Whilst some of those people may be encountering the same symptoms as you are, ie a hardlock on their system, this does not mean that it has a common cause. For example some of the people in one of the threads you are providing have fixed their issue with hardware changes.

Furthermore, all of the tests you have suggested simply prove that your hardware works fine under the conditions that you are testing them under. Not all applications or tests will stress your system the same way. As an example, I had a similar issue to yours on a system I had been using with a different game. I was encountering random crashes and occasional lockups with one single game, all of the stress tests I did showed passed fine, and there were no temperature issues with the system. However as soon as the overclock on the system was removed, this game stopped crashing.

Whilst this is only anecdotal evidence, I hope that it provides some insight as to why I do not believe this to be a game/software related issue.

Now it may be that the game is simply encountering a crash, but because of the configuration of your system, the operating system is unable to handle the crash normally and so the system locks up and restarts. In this case it is very difficult for us to troubleshoot the problem, as we are unable to get any solid crash reports as your system is completely locking up.



With all that said, this game is still one of the best I've seen.  For completeness, here's my response to the above email:

I understand and appreciate your time on this.  So far the workaround for me when this does happen (and it's rarer now that I've been more frugal with binding spells and weapons/armor to the favorites) is to restart the system, load the last saved game and then make a few changes, create a new save file and then reload that new save file.
That seems to consistently get me past a crash point.  The overclock recommendation does not make a difference with my problem since it occurs with normal clock, overclock or underclock.  Believe me when I say that I've spent multiple hours troubleshooting this even before contacting support.

With that said, I'm continuing to enjoy the game (2nd character build now) and am looking forward to the DLC/expansion that's been in the news lately.  I've also began to experiment with the mods available on
Steam.  They definitely add to the replay value for this game, which despite the technical issues I've encountered, I still consider one of the best games made to date.

I'm assuming there was no progress on the save files I've sent previously?  Should I encounter another consistent, repeatable, crash, I'll try sending the save game file (along with my .ini files) in the hopes it will increase game stability.  Perhaps if a game debugging mode was made available/known, I could provide those logs to Bethesda
as well.

Thanks again for your time.

March 03, 2012

Bethesda Support

I've been exchanging emails with Bethesda support regarding the Skyrim issues I've been having.  They've been really patient with me and I really appreciate their support.  The latest exchange has involved them looking at my system configs.  Their current concern is my overclocking:

From Bethesda:

Given that your system is crashing at a very low level and it is restarting this does not really point to a great amount of system stability. It should be noted that the person in the youtube video who which you passed to us is also overclocking their system, which could point to a common demoninator.

My Response:

I'm not sure what to say other than I have the same amount of problems
when I'm not overclocking and even when I'm underclocking and even
when everything is set to auto.  I do agree with your statement in
general terms, but in my case, everything I've tried thus far has NOT
pointed to anything OTHER THAN a stable system.  I have not been able
to create a crash, either system or to desktop, or errors or anything
at all OTHER THAN in Skyrim. The ironic part is that I bought this
system specifically for Skyrim and had only Steam and Skyrim installed
when I first started having problems.  I did NOTHING to my system when
I started having problems, it was literally out of the box.
Troubleshooting started when I started crashing over and over and the
game became unplayable.

Here's a quick summary of what I've done to date:

* Reinstalled Skyrim from Steam
* Validated files in Steam
* Replaced power supply with a known good brand, high rated, over
powered (1000W)
* Reinstalled drivers (Video/System/etc) using a clean install
* Ran Prime95 (to test RAM) for 7 hours
* Ran Linx (to also test RAM) for 1 hour
* Ran OCCT (to test GPU RAM)
* Ran OCCT:CPU (to test CPU)
* Ran Kombustor
* Ran Heaven
* Ran Linx, OCCT and Kombustor AT THE SAME time
* Monitored for GPU, CPU and System Temperatures (logging to a file
every second) - no spikes seen just before crash in Skyrim.
* Removed Creative Labs audio card
* Underclocked CPU, RAM and GPU one at a time
* Disabled Intel speedstep
* Disabled CPU Parking

Non of these tests showed any problem whatsoever.  If the above tests
are not enough to test for system stability, then please let me know
what I can test for.

Again, there are NO ISSUES outside of Skyrim.  Skyrim could run for
four hours or 7 minutes before the system reboots.  At one point I had
a save file that would CONSISTENTLY cause this problem, but that file
could not be opened by your support staff and since the latest patch
would no longer crash for me.  This file would load inside of
Dragonsreach and I would turn around and exit the front door and the
system would reboot.  I did this multiple times in a row and then
loaded a different save game and didn't experience the reboot.  I've
also found that unbinding spells from favorites seemed to have caused
the reboots NOT to happen.  This clearly showed it was *something* in
the save file.  Again, I did this systematically: reboot/crash,
reboot/crash, reboot/crash, unbind spells, no reboot/crash, load
again, reboot/crash.  As I said earlier, this save file could not even
be opened by your support staff - it would cause the game to crash to
desktop.

I have zero issues in other games, namely Batman:Arkham City,
Battlefield 3 and Grid.  But I bought this system specifically to play
Skyrim and it's the only game causing me problems.

Others in the Internet have the same symptoms, as it's popped up in
multiple forums.  Clearly it's something that is specific since only a
minority of the millions of players out there have this problem.  I
know of two people personally who have had no issues in Skyrim.  As I
said above, overclocking is something I've done just recently because
all my testing showed that the system is solid - I can't find a
problem with it AT ALL, except for Skyrim crashing.  In other words,
crashing is isolated to Skyrim.

I read an interview with Bethesda's Todd Howard where he had stated
that things you do in the game cause issues, such as the PS3 lag
that's been widely reported.  I can't help but thing it's something
like that for my issue.  Now, I do understand about software
development and software issues, so what I'm looking for is this: what
is happening with my install/game that is causing this problem?  If
it's something like "binding spell x and spell y at the same time"
then it's not a problem to avoid that behavior.

I've maintained a blog with some graphs and more detailed information
of the troubleshooting I've done, including a red herring of disabling
Intel Speedstep (it did make it more stable, but I think it only
masked the problem) and disabling Windows 7 Core Parking.

This link shows you all the Skyrim related posts:

http://one-miguel.blogspot.com/search/label/Skyrim

But please, if you believe my system is unstable, let me know what
test to run so I can start to pinpoint the problem on my end, because
I've been trying to blame my system for the past three months but I
can't find a problem.

Thank you.