back to article Scary RAM-gobbling bug in SQL Server 2014 exposed by Visual Studio online outage

Microsoft techies have pinpointed the SQL query that smashed the software giant's Visual Studio Team Services offline for several hours. It appears the query triggered a crippling SQL Server 2014 bug that Redmond is now trying to fix. Last week, Microsoft's cloud-based Team Services suffered a five-hour outage, leaving …

  1. Richard 12 Silver badge

    How do you mess that one up?

    The query explicitly states that it will return one row at most!

    How does a memory optimisation ignore the explicit limits set in a query?

    1. Anonymous Coward
      Anonymous Coward

      Re: How do you mess that one up?

      Because the SQL coding team are crap at maths?

    2. Anonymous Coward
      Anonymous Coward

      Re: How do you mess that one up?

      The query may return one row but that does not mean that intermediate stages consist of one row. If it was constrained to processing only one row at a time in memory, it would probably become extremely slow.

      This problem of wanting to return one row but having to process an awful lot of rows to do it is an extremely common problem which frequently causes memory issues. The question for me would be if the schema or views need changing in some way to avoid the issue.

    3. Dan 55 Silver badge

      Re: How do you mess that one up?

      In my admittedly crap hackish experience with SQL, if you're doing TOP 1 or LIMIT 1 or whatever then you're probably doing it wrong anyway because you've got a load of different results for the same query then you throw all but one away. The one you're left with might even be the right answer.

      1. Peter 26

        Re: How do you mess that one up?

        Yes I agree, it is definitely a warning sign. If there is no way round the SQL Code to do what you want without using TOP 1, then you need to have a look at the schema and maybe redesign.

        But I guess they say write first, optimize later if needed, which is probably what happened here.

        1. Anonymous Coward
          Anonymous Coward

          Re: How do you mess that one up?

          "But I guess they say write first, optimize later if needed"

          That's supposed to happen in development, not in large scale customer facing production code!

          But then I have been asked why I needed a full 16Gbytes and a replica of a production database for query testing. Aren't a few thousand rows and a VM enough?

          Answer: No.

      2. Anonymous Coward
        Anonymous Coward

        Re: How do you mess that one up?

        Especially since I can't see an ORDER BY here. Maybe the INDEX hint ensures some kind of retrieval order in SQL Server, but that query looks really ugly. There are also identrical predicates in the ON and WHERE clause.

        1. Anonymous Coward
          Anonymous Coward

          Re: How do you mess that one up?

          "Especially since I can't see an ORDER BY here"

          The query is supposed to return one row, so an ORDER BY on the output is nonfunctional.

          There doesn't seem to be any table where ordering would be likely to short circuit the process, so adding any ordering will just make the whole query bigger and more expensive.

          1. Anonymous Coward
            Anonymous Coward

            Re: How do you mess that one up?

            A) You're sure your query will *always* return one row only, then TOP is useless - and if for any reason an unexpected result is returned, better to see an error/exception than taking blindly the TOP one even if it is the wrong one - leading to data corruption.

            B) Your query returns more than one row, and you need the TOP one - unless you enforce some way the retrieval order, any changes in the way rows are returned will lead to a different result. This may be especially an issue with non-rule based optimizers, which could still change the access path even if "hints" would suggest another. IMHO relying blindly on the "actual implementation" is just too risky.

            1. Philip Lewis

              Re: How do you mess that one up?

              "IMHO relying blindly on the "actual implementation" is just too risky."

              It is not "risky" it is wrong! And when it breaks, the shit will start flying.

              I get so tired of explaining this to morons, one despairs! Just because it "works", does not mean it is semantically or theoretically correct. The amount of "dodgy" code I have ficed in the SQLServer environment where idiots simply assume order because there is an index that just happens to get chosen by the optimizer is very large indeed.

              Without an ORDER BY, there is NO ORDER in the returned results, even if it looks like that at first flush. Learning some relational theory would be a good thing for a very large percentage of the world's programmers and sadly a lot of DBAs :(

          2. Anonymous Coward
            Anonymous Coward

            Re: How do you mess that one up?

            Except in SQL Server, the ORDER BY is also used to determine which rows to include in TOP. In this case, it wouldn't be used for presentation sorting, it would be used to guarantee predictable results (e.g. always the *same* TOP 1, if the data hasn't changed).

    4. mfwiniberg

      Re: How do you mess that one up?

      Some time ago I came across a 'feature' of SQL 2014 where SELECT TOP 1 returned two or more rows. If SELECT TOP 1 doesn't then wouldn't that cause the cardinality estimation to fail?

      Mike

    5. KarateMonkey

      Re: How do you mess that one up?

      "The query explicitly states that it will return one row at most!"

      FAIL.

      Although the aggregate function will _return_ one result, the dataserver has to first produce a potentially much larger result set which is then fed through the aggregate function to derive that result.

      The optimiser has to deal with the entire query plan, not just that final aggregation step.

  2. Hans 1
    Boffin

    >if you're running SQL Server 2014 with big tables,

    you should consider other database vendors, like Oracle, Sybase, or Postgres.

    FTFY

    1. Adam 52 Silver badge

      Not really, you'll have the same problem on Postgres at least. Some of the new DBs are really awful - Amazon Redshift, for example, dies (as in full cluster reboot, come back in 15 mins) like this on anything with vaguely inaccurate statistics, the various Hadoop based SQL engines just abort and MySQL just can't do big tables at all.

      Oracle and SQL Server have amazing query optimisers. SQL Server often has these issues in new releases and they get patched, 2005 was awful at the beginning but turned into a very stable database after sp1. I'm not a big Oracle user but I imagine that they have similar issues.

      1. Anonymous Coward
        Anonymous Coward

        MySQL Is Better Than It Was

        We've got a MySQL DB running on AWS and one of the frequently accessed (as in joined to) tables currently has 220842233 rows. Performance is absolutely fine. Of course good schema and index design comes into play in situations like this so YMMV

      2. h4rm0ny

        Postgres actually doesn't allow query optimization hints. It's one of the things I like about it:

        http://it.toolbox.com/blogs/database-soup/why-postgresql-doesnt-have-query-hints-44121

        Oracle I have never liked. IMO, there are two databases that should be used outside of special cases, and these are Postgres and SQL Server.

    2. Hans 1
      Coffee/keyboard

      Window cleaner and Surface Expert brigade are apparently in full swing this morning, keep 'em coming, the downvotes.

      Everybody on here knows that MS software cannot scale. You will notice that they "fixed" the issue by using a hint that allocates 1% of the 4Gb their "optimizer" (ROFL) thought was required, for one row, serious fail, if you ask me ... the fix is not good enough for me, I'm pretty sure 0.002% would suffice.

      Don't ever let Windows cleaner and Surface Experts near your RDMS, I tell ya!

      Now, fanboyz, anything to comment ?

    3. Adam 1

      So your first suggestion to avoid buggy software is to use something by Oracle?

      1. Anonymous Coward
        Anonymous Coward

        And Oracle makes you pay to get bug fixes...

      2. Anonymous Coward
        Anonymous Coward

        Actually, one of our clients is a mega bank, you know, a bank that comes from some European country well known for chocolate, cheese, precision chronographs (or whatever they call watches this week), and banks ... they happen to use our software to handle financial transactions ... the data is in Oracle Database clusters, now, the dbs are actually "quite" large, the SQL Server database that fell over here is nowhere near that size, seriously, not even in the same universe. And the thing puffs along quite happily, not a database performance issue in years, afaik, and, they would knock on my door if anything goes wrong.

        There is no way in hell you could run that off SQL Server, they would laugh you out of the building.

    4. Anonymous Coward
      Anonymous Coward

      SQLServer has, and has always has had, a brain dead optimizer - especially with non-trivial table cardinalities. I did not notice very many instances where this improved in SQL2014, though I suspect that there were improvements.

      The bottom line is that big tables pose problems, as the optimizer makes assumptions that go seriously wrong. This is partially due to "statistics drift", as automatically updated statistics, aren't, due to the "black box" algorithm that is hard coded. The problem gets out of hand when large numbers of physical partitions are involved.

      There are many simple cases that optimize so poorly that it may choose to scan the table in obviously wrong situations. A simple select of a row using the primary key can sometimes result in a table scan, always wrong in a table with more than a few pages worth of rows.

  3. A Non e-mouse Silver badge

    From the blog posting:

    You can’t, can’t, can’t just “hit the reset button” and say “that fixed it”

    This, from a Microsoft employee? He must be new working with Microsoft software...

    1. Hans 1
      Coffee/keyboard

      Well, what he means is, that they cannot just reset "THIS TIME" because it falls over every three or four seconds ... and Windows does not have systemd, so the time required to reboot is disproportionate.

  4. Anonymous Coward
    Anonymous Coward

    > "How does a memory optimisation ignore the explicit limits set in a query?"

    I'm assuming it's assuming there's an intermediate resultset with more than 1 record, before it extracts the first record, thus giving it more memory..?

    1. Adam 1

      Maybe. The important bit is that the cardinality estimator thinks there is a lot and so (tries to) allocate it.

  5. Duncan Macdonald

    Simple temporay fix

    Just limit the maximum size returned by the predictor to be 100MB (or a per database defined limit) - this will slow down some true large queries but will stop the system from crashing. For a more permanent fix add an option HUGE_QUERY that overrides the limit for the few queries that require a massive amount of RAM.

    These fixes would only require a tiny amount of additional code and no changes to the body of the predictor which would reduce the chance of new bugs being introduced as a result of the fix.

    1. Roq D. Kasba

      Re: Simple temporay fix

      I'd resist that route - you'll just kick the problem down the road with some new, unthoughtof symptoms.

      1. Anonymous Coward
        Anonymous Coward

        Re: Simple temporay fix

        Yes - for every problem there is an answer that is simple, obvious and wrong. For big data problems that applies on a massive scale.

    2. Adam 1

      Re: Simple temporay fix

      Simple problem is that customer A doesn't want their queries slowed because of a problem they don't face with their usage pattern. You don't need to be very imaginative to consider a big data analytics required by a small team of researchers for whom allocation of 10s of GB per user is an average day and no problem. They would be hit for 6 with such a change.

      It isn't just the slowdown either. SQL server uses locking concurrency control* (usually). Even very small decreases in performance can exponentially increase problems with deadlocks.

      * as opposed to MVCC / MGA etc

  6. Sil

    Surely this was no consolation for the victims of the two incidents, but you have to admire the transparency and level of details of Brian Harrys' two blog post detailing the incidents, their investigation and solving, and lessons learned.

    Highly recommended read.

    1. Adam 1

      Actually knowing your vendor will be upfront and wear their mea culpa in public rather than behind legalese (cough vtech) means that it is a culture of learning from your mistakes rather than shoving it under the blanket or finding some rogue engineers to blame (cough VW).

      Now here is where I should feel smug for internally housing our SVN but I suspect that any outage we may suffer will be quickly blamed on some SAN or switch or someone else.

      1. BagOfSpanners

        Yes, it's good that Microsoft have published the technical details and given us the opportunity to discuss and learn from this problem.

  7. arctic_haze

    Yes, but why pizza?

    Have I missed a joke, here?

    1. Paul Crawford Silver badge

      Re: Yes, but why pizza?

      Its what you feed software people when they have to work late to fix something, along with some coffee.

      Where as the rest of the time they just eat pizza.

      1. h4rm0ny

        Re: Yes, but why pizza?

        Surely it would therefore be more time-efficient to simply invent caffeinated pizza?

        Hmmm, you heard it here first, people!

  8. Anonymous Coward
    Anonymous Coward

    interesting

    Can i get out of work if our DB servers die.... /hide

  9. Anonymous Coward
    Trollface

    The facts!

    1KB for the data buffer, 3.5GB for the slurping buffer.

  10. bharry_msft

    Query optimizers

    I've been working around query optimizers for a long time and one of the things I've learned is that they are rocket science. They are incredibly difficult to get right and a small change that improves one query can affect others negatively. That's part of why SQL Server gives you the ability to control (through trace flags) which query optimizer compat level you are using. You should never change the compat level on a large scale database without testing. As I said in my post, the responsibility for this outage it mine, not SQL's. Yes their new cardinality estimator misestimated this query. They will tune it to fix this query while keeping the millions of others working well. This is the business of query optimization. I view part of my job as a large scale Microsoft service as providing a great test case for our platform services. As some of you have surmised the issue is with intermediates, not with the result. So the top 1 alone isn't enough to tell the estimator the answer - I should have mentioned that in my post. I truly believe, there's no incompetence on the SQL Server side - and I'm not just saying that because I work at Microsoft. Those folks do some amazing things and I'm grateful for it.

    Brian Harry

    1. Anonymous Coward
      Anonymous Coward

      Re: Query optimizers

      Sir, I salute you.

      1. h4rm0ny
        Pint

        Re: Query optimizers

        I just want to say, that the openness that Microsoft has shown in explaining this problem, how it occurred, how it was addressed, has scored major points with me. As someone else said up-thread, it shows a culture of learning from mistakes and I see that as an extremely positive thing in a company. After all, everyone makes mistakes - that's mandatory. Learning from / being open about them - that's optional.

        1. Philip Lewis

          Re: Query optimizers

          You have clearly never investigated how many broken optimisations the optimizer comes up with, including many that simply give the wrong results.

          Microsoft just ignores them, and has been ignoring them for quite some time. I expect they will continue to ignore them.

          In essence, they don't care, because they simply do not understand how critical a "wrong answer" optimization is. Oracle, on the other hand, is usually pretty snappy fixing "wrong result" types of error.

          Much as I do not like classic Oracle (I love OracleRdb of course), they at least understand data, and they can deal with very large databases far better than most imagine in their "red mist" dislike of LE.

  11. Spam Junkie
    WTF?

    Stop bullying the optimizer

    Sometimes the estimator can get things wrong. Often it's because statistics are out of date or missing and it has no way to gather a correct estimate. Usually memory grant problems go the other way and the issue is that not enough memory is granted.

    The person who wrote (or "fixed") this particular query has taken matters into his or her own hands by including many hints which force the optimiser to take a specific approach when generating a plan.

    INNER LOOP JOIN

    WITH (INDEX=IX_tbl_Identity_AccountName_TypeId, FORCESEEK, UPDLOCK, HOLDLOCK)

    OPTION (OPTIMIZE FOR (@partitionId UNKNOWN))

    Using query hints to boss around the optimizer is widely considered to be fraught with danger. From the use of these hints in the first place it's clear that there have been performance issues with this query before this problem cropped up. The solution might be as simple as adding an appropriate covering index. It's also difficult to guess what's going on with the temp table #identitites so we're missing the full picture. The use of MAX_GRANT_PERCENT = 1 seems like just another nasty hack and not the final solution.

    The presence of these hints should raise a red flag during code review and should definitely be reviewed before upgrading to a new optimizer.

    IMHO this should be considered a quality process fail for the VSTS team rather than an SQL Server problem.

    1. Philip Lewis

      Re: Stop bullying the optimizer

      I agree. The code expends more effort bending the optimizer than returning the result. As a DBA at my last client, I spent an inordinate amount of time unwinding this sort of crap (and much worse).

      There is a reason that data people who know how databases *actually* work and how to design them logically and physically get paid well - we are few and far between, and the number of "programmer designed" databases seems to be growing exponentially.

  12. Basil Fernie
    FAIL

    Same song, different lyrics

    SQL wants 4GB of RAM? Tut, tut. No ambition.

    I've just farewelled Opera which, in its long-term version 12.xxx for Linux, would after a period open on the desktop and having stealthily taken over just about all the RAM available, make a mighty asset-grab and lay claim to 16777216 TERABYTES of virtual memory, freezing everything else for an undefined period of time (at least days, but I couldn't spare my laptop from active service for a few centuries while waiting to see if it would eventually blink).

    Footnote: Opera continued a development path on Windows and eventually picked up Linux again at about version 34. Unfortunately this is missing the integrated e-mailer which is what kept me faithful for many years, so I'm now free to mix-and-match browsers and e-mailers like all the poor unwashed who never knew the benefits of great design. Only thing that keeps me away from Chrome is Google.

  13. Hans 1
    Mushroom

    MS SQL is the perfect DB for the enterprise

    Hey, you, yes YOU, listen .... if you you are talking enterprise DB's, as in, big db's, with many, many rows, and you don't know Oracle or DB2, your opinion does not count, you understand, go back to your Crayola!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like