Silence on the Wire
A good book I read a while back was Silence on the Wire, about all of the data you could glean from a network just by listening. If someone wants to analyze your traffic, there's actually a lot that inadvertently leaks out.
An infosec educator from the United States Military Academy at West Point has taken a look at Netflix's HTTPS implementation, and reckons all he needs to know what programs you like is a bit of passive traffic capture. The problem, writes Michael Kranch (with collaborator Andrew Reed), is information in TCP/IP headers are …
It's been my opinion for a long time that relying on HTTPS alone for real security is like boiling water in a paper bag - yes, it can be done but most of the time it's a fail.
I'd be very surprised if this attack is limited to NetFlix and in any case traffic analysis can tell you a lot even if you don't break/work around SSL - which I suspect is not a barrier to nation state access Tin Foil hat? Don't bother, they don't work either..
And half can be identified in under 2 minutes. It may even tell them something about you if you watch only the first minute or so of certain number of unidentified films or shows, or maybe the ratio between partial and fully watched (and they know which ones you fully watched so might infer what you partially watched) or maybe the number you started then rejected. It's all data that may be useful in identifing personality traits or other factors, especially over the longer term.
Of course, much of that is only useful if they can identify the user, but that can probably be inferred long term even on a set top box by the patterns of which shows are watched and when. eg my wife loves crime and court dramas, I love SF, but we also have overlap in both genres that we watch together (as well as other genres of course). That data could, with a decent degree of accuracy, determine whether either my wife, me or both of us are watching TV at any chosen point, even though we generally don't "sign in" uniquely or use profiles.
It goes without saying that if you can monitor the traffic flow and have something to compare it against, you can match stuff up. It's just that they spent the time to do it a specific way. It's not really rocket science. Just proving something we already knew but never spent the effort to test.
It's not dissimilar to the BBC's ultra high tech method of knowing you are watching eastenders in winter as they watch it on a little screen and match up the flashes around your curtain upon scene changes. Watching data packet types based on changing data rates for light/dark/action scenes (or with VBR music, bass/treble/melody/timing) is just a digital way of doing the same.
If some services change the bitrate depending on congestion or saturation of an individual line, it would be interesting to see if this technique still worked. Also does running two Netflix streams behind a firewall or NAT make the detection harder.
That's not how it works. The connection is HTTPS, so the secret key is specific to the browser session, so it's not the same as matching "up the flashes around your curtain upon scene changes". The flashes will be specific to the viewer.
Silverlight/DASH/VBR produces specific sequences of video segment sizes, which can be extracted from the headers. Apparently.
And, more interestingly, someone is still using Silverlight.
This is Andrew, one of the paper's authors. I just want to clarify that our technique does not rely upon Silverlight - the issue lies with the combination of DASH (as a means to deliver video) and VBR (as a means to encode video).
Regarding HTTPS: If you and I watch the same video at the same bitrate, Netflix will send us the same amount of data per unit time. That is what we rely on with our technique. We do not bother looking at the encrypted application-layer data.
>"Kranch offers a couple of ideas to fix the issue. For example, he says, “the browser could average the size of several consecutive segments and send HTTP GETs for this average size. As an alternative approach, the browser could randomly combine consecutive segments and send HTTP GETs for the combined video data.”
Or even better, make GET requests that match up with the profile of a completely different video.
Just fill up the window size so its always the MTU, nothing wrong with stuffing parts of the next few frames into the previous packet, and at the end, just shove in some random data. At the very least, it'd cut down on buffering and wouldn't really use all that much more bandwidth since networking devices already expect a 1520-byte packet and use buffers assuming that size (and usually shove packets into the buffer spaced 1520 bytes apart).
This attack relies on the variability of the window size, so if everything is the maximum, there is nothing to analyze. Obviously it would need to find a way of figuring out what that maximum size is (Eg, detecting if there is some piece of equipment in between that lower than expected and causing fragments)
This is Andrew, one of the paper's authors. This solution would not defeat our technique (as described), since our technique is not concerned with individual packets. Instead, we use a program called adudump to infer the size of each transmitted video segment (each video segment is 4 seconds long and is individually requested by an HTTP GET). In fact, since these video segments are requested via HTTP GET, *the vast majority* of packets are already filled-to-the-brim with 1460 bytes of application-layer data. Only the last packet in the HTTP response will have less than 1460 bytes. Padding this last packet would just be noise to our algorithm.
That being said, this technique could be employed at a higher level (which is a suggestion that we make in the paper). Instead of requesting fixed intervals (always 4 seconds of video with each HTTP GET), the Netflix video player could alter its requests to instead aggregate several consecutive segments, or randomize their order so that they are not requested sequentially.
@ Networc
Ah, that makes sense. In that case, I assume they are just grabbing an I-Frame + associated P-Frames, waiting for confirmation of reception, then sending the next I-Frame + its P-Frames. I thought they'd be sending based on portions of the video file as stored on the filesystem versus portions of video as stored in its container. Makes sense architecturally since the client would track state rather than being dependent on the server to do so.
Perhaps the solution may be to re-encode the videos with a format that determines the placement of I-Frames on the total number of bytes changed since last I-Frame, rather than number of P-Frames since last I-Frame. Although that would mess with video seeking (although if nothing much is really changing, wouldn't you want to skip to beginning or end of that scene directly? Like if the scene is a new caster sitting still and addressing the audience, so really only the pixels making up their mouth would change from one frame to the next and you would either want to see it in its entirety or skip it in its entirety).
So, I will be the first to admit that video encoding starts to reach the limits of my expertise. That being said, when a video is encoded for DASH, it is encoded at multiple bitrates (quality levels) and the encoder is set to ensure that each time chunk (4 sec for Netflix) of each bitrate is playable in isolation, i.e. no Group-of-Pictures (GOP) spans consecutive time slots.
That allows the client to jump to any time slot at any bitrate and not require the neighboring time slots. This is also what enables the transition between quality levels in response to bandwidth conditions.
It's as if you're watching a playlist of many 4 second videos, each of which was retrieved via an HTTP GET.
This test isn't entirely accurate because of the small sample size. 100 titles generated 184 million data points, and under 4 minutes of watching one of those titles can determine which of the 100 titles was watched.
Netflix have quite a bit more than 100 titles, which means a massive increase in the number of data points to consider. Let's be generous, and say their algorithm has reasonable time complexity and can be completely parallelised. What cannot be done is reduce the number of potential matches. With trillions of data points and millions of potential movies, the time that is required to give a definitive match will increase rapidly.
This is Andrew, one of the paper's authors. We first crawled Netflix to amass the full fingerprints for over 42,000 video titles (movies and shows). From these full fingerprints, we create a database that consists of every possible 2 minute window of each video. This constitutes the 184 million "data points".
Prior to testing our approach on network traffic, we first checked every possible window (the 184 million data points) to see which windows "look like" other windows to our algorithm. Obviously it would be a waste of our time to try our technique on network traffic if video fingerprints look identical. There are only a very small percentage of videos that "look alike", and many of those are actually comprised of identical footage.
So, with the confidence that our algorithm will not mistakenly identify a given 2 minute window of video, we tested it against actual network traffic (2 devices watching 100 video titles simultaneously).
It's not possible to avoid tracking anywhere, either online or in meatspace. Your phone can be tracked passively everywhere you go by the carrier, IMSI catchers, MAC address, Bluetooth, etc. -- not to mention the built-in tracking "features" of the phone OS.
SO you leave your phone at home when you go to the store. Your car is tracked en route by license plate scanners and security cams. You don't use the store loyalty card (because privacy), but you're tracked by your credit or debit card number. Paying cash? Well, every checkout lane has a security cam pointed at it that's time-synced with the register system. So boom, they got you with facial recognition -- and if you wear a mask or hat, then gait analysis.
SO you stay home, and you avoid Amazon, Facebook, et al (because privacy). Meanwhile, the contents of your emails, calendar, cloud docs, etc. are scanned in order to serve you "relevant ads." Your online photos are scanned for child porn. Your keystrokes, mouse movements, app usage, web history, etc. are monitored by Windows 10 (with the default settings) and sent back to the MotherShip, in order to improve...something.
SO you turn off the computer and phone, draw the shades, and turn on the TV. Netflix themselves are already tracking you, as is the device that you use to stream it. Plus, the smart TV is sending all kinds of info back to...someone, and streaming everything you say to a cloud server to support the voice recognition "feature."
Hm, that's no good, so you unplug the router and switch to the antenna input. Your smart electric meter can be used to identify the over-the-air TV program you're watching by measuring the variance in electric usage by your TV, using a similar fingerprinting process to the one in the article.
Given all of that, this traffic analysis "leak" seems pretty tame.
I can imagine Nielsen, and others, will be dashing off to try and implement this to get viewing figures for their customers that are currently unavailable to them.
US ISPs, with their new freedom to sell off aggregate customer data, will be ideally placed to provide the network access.