Microsoft techies have pinpointed the SQL query that smashed the software giant's Visual Studio Team Services offline for several hours. It appears the query triggered a crippling SQL Server 2014 bug that Redmond is now trying to fix. Last week, Microsoft's cloud-based Team Services suffered a five-hour outage, leaving …

COMMENTS

Post your comment

House rules Send corrections

Add to 'My topics'

Tuesday 9th February 2016 07:53 GMT Richard 12

How do you mess that one up?

The query explicitly states that it will return one row at most!

How does a memory optimisation ignore the explicit limits set in a query?

7 1 Reply
1. Tuesday 9th February 2016 08:05 GMT Anonymous Coward
  
  Re: How do you mess that one up?
  
  Because the SQL coding team are crap at maths?
  
  3 7 Reply
2. Tuesday 9th February 2016 08:24 GMT Anonymous Coward
  
  Re: How do you mess that one up?
  
  The query may return one row but that does not mean that intermediate stages consist of one row. If it was constrained to processing only one row at a time in memory, it would probably become extremely slow.
  
  This problem of wanting to return one row but having to process an awful lot of rows to do it is an extremely common problem which frequently causes memory issues. The question for me would be if the schema or views need changing in some way to avoid the issue.
  
  13 0 Reply
3. Tuesday 9th February 2016 09:10 GMT Dan 55
  
  Re: How do you mess that one up?
  
  In my admittedly crap hackish experience with SQL, if you're doing TOP 1 or LIMIT 1 or whatever then you're probably doing it wrong anyway because you've got a load of different results for the same query then you throw all but one away. The one you're left with might even be the right answer.
  
  5 0 Reply
  1. Tuesday 9th February 2016 09:33 GMT Peter 26
    
    Re: How do you mess that one up?
    
    Yes I agree, it is definitely a warning sign. If there is no way round the SQL Code to do what you want without using TOP 1, then you need to have a look at the schema and maybe redesign.
    
    But I guess they say write first, optimize later if needed, which is probably what happened here.
    
    3 0 Reply
    1. Tuesday 9th February 2016 12:14 GMT Anonymous Coward
      
      Re: How do you mess that one up?
      
      "But I guess they say write first, optimize later if needed"
      
      That's supposed to happen in development, not in large scale customer facing production code!
      
      But then I have been asked why I needed a full 16Gbytes and a replica of a production database for query testing. Aren't a few thousand rows and a VM enough?
      
      Answer: No.
      
      4 0 Reply
  2. Tuesday 9th February 2016 13:27 GMT Anonymous Coward
    
    Re: How do you mess that one up?
    
    Especially since I can't see an ORDER BY here. Maybe the INDEX hint ensures some kind of retrieval order in SQL Server, but that query looks really ugly. There are also identrical predicates in the ON and WHERE clause.
    
    3 0 Reply
    1. Tuesday 9th February 2016 18:52 GMT Anonymous Coward
      
      Re: How do you mess that one up?
      
      "Especially since I can't see an ORDER BY here"
      
      The query is supposed to return one row, so an ORDER BY on the output is nonfunctional.
      
      There doesn't seem to be any table where ordering would be likely to short circuit the process, so adding any ordering will just make the whole query bigger and more expensive.
      
      0 1 Reply
      1. Wednesday 10th February 2016 08:38 GMT Anonymous Coward
        
        Re: How do you mess that one up?
        
        A) You're sure your query will *always* return one row only, then TOP is useless - and if for any reason an unexpected result is returned, better to see an error/exception than taking blindly the TOP one even if it is the wrong one - leading to data corruption.
        
        B) Your query returns more than one row, and you need the TOP one - unless you enforce some way the retrieval order, any changes in the way rows are returned will lead to a different result. This may be especially an issue with non-rule based optimizers, which could still change the access path even if "hints" would suggest another. IMHO relying blindly on the "actual implementation" is just too risky.
        
        2 0 Reply
        
        Friday 12th February 2016 00:54 GMT Philip Lewis
        
        Re: How do you mess that one up?
        
        "IMHO relying blindly on the "actual implementation" is just too risky."
        
        It is not "risky" it is wrong! And when it breaks, the shit will start flying.
        
        I get so tired of explaining this to morons, one despairs! Just because it "works", does not mean it is semantically or theoretically correct. The amount of "dodgy" code I have ficed in the SQLServer environment where idiots simply assume order because there is an index that just happens to get chosen by the optimizer is very large indeed.
        
        Without an ORDER BY, there is NO ORDER in the returned results, even if it looks like that at first flush. Learning some relational theory would be a good thing for a very large percentage of the world's programmers and sadly a lot of DBAs :(
        
        0 0 Reply
      2. Thursday 11th February 2016 20:47 GMT Anonymous Coward
        
        Re: How do you mess that one up?
        
        Except in SQL Server, the ORDER BY is also used to determine which rows to include in TOP. In this case, it wouldn't be used for presentation sorting, it would be used to guarantee predictable results (e.g. always the *same* TOP 1, if the data hasn't changed).
        
        0 0 Reply
4. Tuesday 9th February 2016 09:36 GMT mfwiniberg
  
  Re: How do you mess that one up?
  
  Some time ago I came across a 'feature' of SQL 2014 where SELECT TOP 1 returned two or more rows. If SELECT TOP 1 doesn't then wouldn't that cause the cardinality estimation to fail?
  
  Mike
  
  1 0 Reply
5. Tuesday 9th February 2016 18:51 GMT KarateMonkey
  
  Re: How do you mess that one up?
  
  "The query explicitly states that it will return one row at most!"
  
  FAIL.
  
  Although the aggregate function will _return_ one result, the dataserver has to first produce a potentially much larger result set which is then fed through the aggregate function to derive that result.
  
  The optimiser has to deal with the entire query plan, not just that final aggregation step.
  
  4 0 Reply
Tuesday 9th February 2016 08:15 GMT Hans 1

>if you're running SQL Server 2014 with big tables,

you should consider other database vendors, like Oracle, Sybase, or Postgres.

FTFY

3 18 Reply
1. Tuesday 9th February 2016 08:44 GMT Adam 52
  
  Not really, you'll have the same problem on Postgres at least. Some of the new DBs are really awful - Amazon Redshift, for example, dies (as in full cluster reboot, come back in 15 mins) like this on anything with vaguely inaccurate statistics, the various Hadoop based SQL engines just abort and MySQL just can't do big tables at all.
  
  Oracle and SQL Server have amazing query optimisers. SQL Server often has these issues in new releases and they get patched, 2005 was awful at the beginning but turned into a very stable database after sp1. I'm not a big Oracle user but I imagine that they have similar issues.
  
  15 0 Reply
  1. Tuesday 9th February 2016 09:57 GMT Anonymous Coward
    
    MySQL Is Better Than It Was
    
    We've got a MySQL DB running on AWS and one of the frequently accessed (as in joined to) tables currently has 220842233 rows. Performance is absolutely fine. Of course good schema and index design comes into play in situations like this so YMMV
    
    4 4 Reply
  2. Wednesday 10th February 2016 12:03 GMT h4rm0ny
    
    Postgres actually doesn't allow query optimization hints. It's one of the things I like about it:
    
    http://it.toolbox.com/blogs/database-soup/why-postgresql-doesnt-have-query-hints-44121
    
    Oracle I have never liked. IMO, there are two databases that should be used outside of special cases, and these are Postgres and SQL Server.
    
    0 0 Reply
2. Tuesday 9th February 2016 10:49 GMT Hans 1
  
  Window cleaner and Surface Expert brigade are apparently in full swing this morning, keep 'em coming, the downvotes.
  
  Everybody on here knows that MS software cannot scale. You will notice that they "fixed" the issue by using a hint that allocates 1% of the 4Gb their "optimizer" (ROFL) thought was required, for one row, serious fail, if you ask me ... the fix is not good enough for me, I'm pretty sure 0.002% would suffice.
  
  Don't ever let Windows cleaner and Surface Experts near your RDMS, I tell ya!
  
  Now, fanboyz, anything to comment ?
  
  4 12 Reply
3. Tuesday 9th February 2016 21:43 GMT Adam 1
  
  So your first suggestion to avoid buggy software is to use something by Oracle?
  
  5 0 Reply
  1. Wednesday 10th February 2016 08:39 GMT Anonymous Coward
    
    And Oracle makes you pay to get bug fixes...
    
    1 0 Reply
  2. Thursday 11th February 2016 16:27 GMT Anonymous Coward
    
    Actually, one of our clients is a mega bank, you know, a bank that comes from some European country well known for chocolate, cheese, precision chronographs (or whatever they call watches this week), and banks ... they happen to use our software to handle financial transactions ... the data is in Oracle Database clusters, now, the dbs are actually "quite" large, the SQL Server database that fell over here is nowhere near that size, seriously, not even in the same universe. And the thing puffs along quite happily, not a database performance issue in years, afaik, and, they would knock on my door if anything goes wrong.
    
    There is no way in hell you could run that off SQL Server, they would laugh you out of the building.
    
    2 0 Reply
4. Wednesday 10th February 2016 02:53 GMT Anonymous Coward
  
  SQLServer has, and has always has had, a brain dead optimizer - especially with non-trivial table cardinalities. I did not notice very many instances where this improved in SQL2014, though I suspect that there were improvements.
  
  The bottom line is that big tables pose problems, as the optimizer makes assumptions that go seriously wrong. This is partially due to "statistics drift", as automatically updated statistics, aren't, due to the "black box" algorithm that is hard coded. The problem gets out of hand when large numbers of physical partitions are involved.
  
  There are many simple cases that optimize so poorly that it may choose to scan the table in obviously wrong situations. A simple select of a row using the primary key can sometimes result in a table scan, always wrong in a table with more than a few pages worth of rows.
  
  2 0 Reply
Tuesday 9th February 2016 09:11 GMT A Non e-mouse

From the blog posting:

You can’t, can’t, can’t just “hit the reset button” and say “that fixed it”

This, from a Microsoft employee? He must be new working with Microsoft software...

3 1 Reply
1. Tuesday 9th February 2016 10:53 GMT Hans 1
  
  Well, what he means is, that they cannot just reset "THIS TIME" because it falls over every three or four seconds ... and Windows does not have systemd, so the time required to reboot is disproportionate.
  
  0 8 Reply
Tuesday 9th February 2016 09:36 GMT Anonymous Coward

> "How does a memory optimisation ignore the explicit limits set in a query?"

I'm assuming it's assuming there's an intermediate resultset with more than 1 record, before it extracts the first record, thus giving it more memory..?

0 0 Reply
1. Tuesday 9th February 2016 21:45 GMT Adam 1
  
  Maybe. The important bit is that the cardinality estimator thinks there is a lot and so (tries to) allocate it.
  
  0 0 Reply
Tuesday 9th February 2016 10:13 GMT Duncan Macdonald

Simple temporay fix

Just limit the maximum size returned by the predictor to be 100MB (or a per database defined limit) - this will slow down some true large queries but will stop the system from crashing. For a more permanent fix add an option HUGE_QUERY that overrides the limit for the few queries that require a massive amount of RAM.

These fixes would only require a tiny amount of additional code and no changes to the body of the predictor which would reduce the chance of new bugs being introduced as a result of the fix.

0 0 Reply
1. Tuesday 9th February 2016 10:17 GMT Roq D. Kasba
  
  Re: Simple temporay fix
  
  I'd resist that route - you'll just kick the problem down the road with some new, unthoughtof symptoms.
  
  4 0 Reply
  1. Tuesday 9th February 2016 12:16 GMT Anonymous Coward
    
    Re: Simple temporay fix
    
    Yes - for every problem there is an answer that is simple, obvious and wrong. For big data problems that applies on a massive scale.
    
    5 0 Reply
2. Tuesday 9th February 2016 21:58 GMT Adam 1
  
  Re: Simple temporay fix
  
  Simple problem is that customer A doesn't want their queries slowed because of a problem they don't face with their usage pattern. You don't need to be very imaginative to consider a big data analytics required by a small team of researchers for whom allocation of 10s of GB per user is an average day and no problem. They would be hit for 6 with such a change.
  
  It isn't just the slowdown either. SQL server uses locking concurrency control* (usually). Even very small decreases in performance can exponentially increase problems with deadlocks.
  
  * as opposed to MVCC / MGA etc
  
  1 0 Reply
Tuesday 9th February 2016 10:50 GMT Sil

Surely this was no consolation for the victims of the two incidents, but you have to admire the transparency and level of details of Brian Harrys' two blog post detailing the incidents, their investigation and solving, and lessons learned.

Highly recommended read.

16 0 Reply
1. Tuesday 9th February 2016 22:14 GMT Adam 1
  
  Actually knowing your vendor will be upfront and wear their mea culpa in public rather than behind legalese (cough vtech) means that it is a culture of learning from your mistakes rather than shoving it under the blanket or finding some rogue engineers to blame (cough VW).
  
  Now here is where I should feel smug for internally housing our SVN but I suspect that any outage we may suffer will be quickly blamed on some SAN or switch or someone else.
  
  2 0 Reply
  1. Wednesday 10th February 2016 06:49 GMT BagOfSpanners
    
    Yes, it's good that Microsoft have published the technical details and given us the opportunity to discuss and learn from this problem.
    
    1 0 Reply
Tuesday 9th February 2016 18:26 GMT arctic_haze

Yes, but why pizza?

Have I missed a joke, here?

0 0 Reply
1. Tuesday 9th February 2016 22:47 GMT Paul Crawford
  
  Re: Yes, but why pizza?
  
  Its what you feed software people when they have to work late to fix something, along with some coffee.
  
  Where as the rest of the time they just eat pizza.
  
  0 0 Reply
  1. Wednesday 10th February 2016 12:11 GMT h4rm0ny
    
    Re: Yes, but why pizza?
    
    Surely it would therefore be more time-efficient to simply invent caffeinated pizza?
    
    Hmmm, you heard it here first, people!
    
    2 0 Reply
Tuesday 9th February 2016 20:25 GMT Anonymous Coward

interesting

Can i get out of work if our DB servers die.... /hide

0 0 Reply
Tuesday 9th February 2016 20:44 GMT Anonymous Coward

The facts!

1KB for the data buffer, 3.5GB for the slurping buffer.

1 1 Reply
Tuesday 9th February 2016 21:22 GMT bharry_msft

Query optimizers

I've been working around query optimizers for a long time and one of the things I've learned is that they are rocket science. They are incredibly difficult to get right and a small change that improves one query can affect others negatively. That's part of why SQL Server gives you the ability to control (through trace flags) which query optimizer compat level you are using. You should never change the compat level on a large scale database without testing. As I said in my post, the responsibility for this outage it mine, not SQL's. Yes their new cardinality estimator misestimated this query. They will tune it to fix this query while keeping the millions of others working well. This is the business of query optimization. I view part of my job as a large scale Microsoft service as providing a great test case for our platform services. As some of you have surmised the issue is with intermediates, not with the result. So the top 1 alone isn't enough to tell the estimator the answer - I should have mentioned that in my post. I truly believe, there's no incompetence on the SQL Server side - and I'm not just saying that because I work at Microsoft. Those folks do some amazing things and I'm grateful for it.

Brian Harry

11 0 Reply
1. Wednesday 10th February 2016 12:02 GMT Anonymous Coward
  
  Re: Query optimizers
  
  Sir, I salute you.
  
  0 1 Reply
  1. Wednesday 10th February 2016 12:14 GMT h4rm0ny
    
    Re: Query optimizers
    
    I just want to say, that the openness that Microsoft has shown in explaining this problem, how it occurred, how it was addressed, has scored major points with me. As someone else said up-thread, it shows a culture of learning from mistakes and I see that as an extremely positive thing in a company. After all, everyone makes mistakes - that's mandatory. Learning from / being open about them - that's optional.
    
    2 0 Reply
    1. Friday 12th February 2016 01:01 GMT Philip Lewis
      
      Re: Query optimizers
      
      You have clearly never investigated how many broken optimisations the optimizer comes up with, including many that simply give the wrong results.
      
      Microsoft just ignores them, and has been ignoring them for quite some time. I expect they will continue to ignore them.
      
      In essence, they don't care, because they simply do not understand how critical a "wrong answer" optimization is. Oracle, on the other hand, is usually pretty snappy fixing "wrong result" types of error.
      
      Much as I do not like classic Oracle (I love OracleRdb of course), they at least understand data, and they can deal with very large databases far better than most imagine in their "red mist" dislike of LE.
      
      0 0 Reply
Wednesday 10th February 2016 00:42 GMT Spam Junkie

Stop bullying the optimizer

Sometimes the estimator can get things wrong. Often it's because statistics are out of date or missing and it has no way to gather a correct estimate. Usually memory grant problems go the other way and the issue is that not enough memory is granted.

The person who wrote (or "fixed") this particular query has taken matters into his or her own hands by including many hints which force the optimiser to take a specific approach when generating a plan.

INNER LOOP JOIN

WITH (INDEX=IX_tbl_Identity_AccountName_TypeId, FORCESEEK, UPDLOCK, HOLDLOCK)

OPTION (OPTIMIZE FOR (@partitionId UNKNOWN))

Using query hints to boss around the optimizer is widely considered to be fraught with danger. From the use of these hints in the first place it's clear that there have been performance issues with this query before this problem cropped up. The solution might be as simple as adding an appropriate covering index. It's also difficult to guess what's going on with the temp table #identitites so we're missing the full picture. The use of MAX_GRANT_PERCENT = 1 seems like just another nasty hack and not the final solution.

The presence of these hints should raise a red flag during code review and should definitely be reviewed before upgrading to a new optimizer.

IMHO this should be considered a quality process fail for the VSTS team rather than an SQL Server problem.

2 0 Reply
1. Wednesday 10th February 2016 03:00 GMT Philip Lewis
  
  Re: Stop bullying the optimizer
  
  I agree. The code expends more effort bending the optimizer than returning the result. As a DBA at my last client, I spent an inordinate amount of time unwinding this sort of crap (and much worse).
  
  There is a reason that data people who know how databases *actually* work and how to design them logically and physically get paid well - we are few and far between, and the number of "programmer designed" databases seems to be growing exponentially.
  
  3 0 Reply
Wednesday 10th February 2016 08:07 GMT Basil Fernie

Same song, different lyrics

SQL wants 4GB of RAM? Tut, tut. No ambition.

I've just farewelled Opera which, in its long-term version 12.xxx for Linux, would after a period open on the desktop and having stealthily taken over just about all the RAM available, make a mighty asset-grab and lay claim to 16777216 TERABYTES of virtual memory, freezing everything else for an undefined period of time (at least days, but I couldn't spare my laptop from active service for a few centuries while waiting to see if it would eventually blink).

Footnote: Opera continued a development path on Windows and eventually picked up Linux again at about version 34. Unfortunately this is missing the integrated e-mailer which is what kept me faithful for many years, so I'm now free to mix-and-match browsers and e-mailers like all the poor unwashed who never knew the benefits of great design. Only thing that keeps me away from Chrome is Google.

1 0 Reply
Monday 15th February 2016 15:08 GMT Hans 1

MS SQL is the perfect DB for the enterprise

Hey, you, yes YOU, listen .... if you you are talking enterprise DB's, as in, big db's, with many, many rows, and you don't know Oracle or DB2, your opinion does not count, you understand, go back to your Crayola!

0 0 Reply