back to article TCP/IP headers leak info about what you're watching on Netflix

An infosec educator from the United States Military Academy at West Point has taken a look at Netflix's HTTPS implementation, and reckons all he needs to know what programs you like is a bit of passive traffic capture. The problem, writes Michael Kranch (with collaborator Andrew Reed), is information in TCP/IP headers are …

  1. Brian Miller

    Silence on the Wire

    A good book I read a while back was Silence on the Wire, about all of the data you could glean from a network just by listening. If someone wants to analyze your traffic, there's actually a lot that inadvertently leaks out.

    1. Anonymous Coward
      Anonymous Coward

      Re: Silence on the Wire

      When all else fails there's always traffic flow analysis. The only way to thoroughly defeat it is to be completely time-neutral. And before you know it, you end up needing a hard real time operating system.

      How many of those run on desktops or mobiles?

      1. DropBear

        Re: Silence on the Wire

        "How many of those run on desktops or mobiles?"

        Weird, I was under the distinct impression there was a thoroughly unremarkable (and very old) desktop in the next room spinning sizeable CNC motors up and down smoothly one CPU-generated step pulse at a time...

  2. Pen-y-gors

    So.....

    My excuse for having a library of 100,000 grumble-flicks downloaded from Pr0nhub is that "I was fingerprinting them to analyse network traffic"

    I suppose it's worth a try...

    1. Anonymous Coward
      Anonymous Coward

      Re: So.....

      Hey, we can't all use that excuse, it'd look mighty suspicious!

      Facebook chat between wives/partners/etc:

      First: "He's doing network traffic research again"

      Second: "Your's too?"

      1. Alan W. Rateliff, II
        Happy

        Re: So.....

        Crowd-sourced research.

      2. Robert Carnegie Silver badge

        Re: So.....

        http://dilbert.com/strip/1995-07-26 of course.

        "stress-test our network by downloading from the busiest servers on the internet"

        (not counting Dilbert.com and The Register)

    2. MyffyW Silver badge

      Re: So.....

      Personally I take a tolerant attitude to this because I think a little bit of casual network traffic analysis does no harm whatsoever.

      However please don't make me watch "Fifty Shades" again - not because of the nudity, just because it's shit.

    3. JLV

      Re: So.....

      That's commitment there.

      Did you manage to automate, or do you have to do it manually?

  3. WibbleMe

    The names Archer, Sterling Archer

  4. P. Lee

    and all available to your US ISP for collection and sale

    eom

  5. Version 1.0 Silver badge
    Facepalm

    Of course this only affects Netflix?

    It's been my opinion for a long time that relying on HTTPS alone for real security is like boiling water in a paper bag - yes, it can be done but most of the time it's a fail.

    I'd be very surprised if this attack is limited to NetFlix and in any case traffic analysis can tell you a lot even if you don't break/work around SSL - which I suspect is not a barrier to nation state access Tin Foil hat? Don't bother, they don't work either..

  6. Simon Harris

    3 minutes 55 seconds

    Netflix keeps providing 'Because you watched...' suggestions based on movies I didn't even get that far into before deciding they were pants.

    1. John Brown (no body) Silver badge

      Re: 3 minutes 55 seconds

      And half can be identified in under 2 minutes. It may even tell them something about you if you watch only the first minute or so of certain number of unidentified films or shows, or maybe the ratio between partial and fully watched (and they know which ones you fully watched so might infer what you partially watched) or maybe the number you started then rejected. It's all data that may be useful in identifing personality traits or other factors, especially over the longer term.

      Of course, much of that is only useful if they can identify the user, but that can probably be inferred long term even on a set top box by the patterns of which shows are watched and when. eg my wife loves crime and court dramas, I love SF, but we also have overlap in both genres that we watch together (as well as other genres of course). That data could, with a decent degree of accuracy, determine whether either my wife, me or both of us are watching TV at any chosen point, even though we generally don't "sign in" uniquely or use profiles.

    2. Anonymous Coward
      Anonymous Coward

      Re: 3 minutes 55 seconds

      My complaint is that a programme will show in my 'continue watching' list if I accidentally play an episode of it, usually when one series ends so it tries to play something completely different without me noticing.

  7. Anonymous Coward
    Anonymous Coward

    Alright lads we're half way there...

    Now all we need is to figure out is how to make streams from Pornhub match the Netflix fingerprints. Stealthy porn sessions are almost here.

  8. NonSSL-Login

    Stating the obvious

    It goes without saying that if you can monitor the traffic flow and have something to compare it against, you can match stuff up. It's just that they spent the time to do it a specific way. It's not really rocket science. Just proving something we already knew but never spent the effort to test.

    It's not dissimilar to the BBC's ultra high tech method of knowing you are watching eastenders in winter as they watch it on a little screen and match up the flashes around your curtain upon scene changes. Watching data packet types based on changing data rates for light/dark/action scenes (or with VBR music, bass/treble/melody/timing) is just a digital way of doing the same.

    If some services change the bitrate depending on congestion or saturation of an individual line, it would be interesting to see if this technique still worked. Also does running two Netflix streams behind a firewall or NAT make the detection harder.

    1. Displacement Activity

      Re: Stating the obvious

      That's not how it works. The connection is HTTPS, so the secret key is specific to the browser session, so it's not the same as matching "up the flashes around your curtain upon scene changes". The flashes will be specific to the viewer.

      Silverlight/DASH/VBR produces specific sequences of video segment sizes, which can be extracted from the headers. Apparently.

      And, more interestingly, someone is still using Silverlight.

      1. Networc

        Re: Stating the obvious

        This is Andrew, one of the paper's authors. I just want to clarify that our technique does not rely upon Silverlight - the issue lies with the combination of DASH (as a means to deliver video) and VBR (as a means to encode video).

        Regarding HTTPS: If you and I watch the same video at the same bitrate, Netflix will send us the same amount of data per unit time. That is what we rely on with our technique. We do not bother looking at the encrypted application-layer data.

  9. Anonymous Coward
    Anonymous Coward

    Not even HTTPS can hide your secret Gilmore Girls fetish

    I'm completely open about my Gilmore Girls fetish!

    1. J.Smith

      My sentiments too, what's wrong with Gilmore Girls, and who'd want to hide watching it? My fetish for The Golden Girls on the other hand, I'm keeping that secret.

      1. Swarthy
        Coffee/keyboard

        I should not read El Reg while eating Lunch

        Well done you two!

    2. 2Nick3

      Re: Not even HTTPS can hide your secret Gilmore Girls fetish

      The AC is completely open??

  10. Alistair
    Pint

    oh darn

    Someone will figure out I'm binge watching Death in Paradise on weekend mornings.

  11. Tony Haines
    Happy

    Either

    >"Kranch offers a couple of ideas to fix the issue. For example, he says, “the browser could average the size of several consecutive segments and send HTTP GETs for this average size. As an alternative approach, the browser could randomly combine consecutive segments and send HTTP GETs for the combined video data.”

    Or even better, make GET requests that match up with the profile of a completely different video.

  12. Anonymous Coward
    Anonymous Coward

    Monsters Inc.

    I'm about to watch that on DVD. Last night it was The Martian. The night before The Lady In The Van.

    Anonymous because I really don't want anyone figuring out my bank details from knowing that :)

  13. Crazy Operations Guy

    Easy to prevent

    Just fill up the window size so its always the MTU, nothing wrong with stuffing parts of the next few frames into the previous packet, and at the end, just shove in some random data. At the very least, it'd cut down on buffering and wouldn't really use all that much more bandwidth since networking devices already expect a 1520-byte packet and use buffers assuming that size (and usually shove packets into the buffer spaced 1520 bytes apart).

    This attack relies on the variability of the window size, so if everything is the maximum, there is nothing to analyze. Obviously it would need to find a way of figuring out what that maximum size is (Eg, detecting if there is some piece of equipment in between that lower than expected and causing fragments)

    1. Networc

      Re: Easy to prevent

      This is Andrew, one of the paper's authors. This solution would not defeat our technique (as described), since our technique is not concerned with individual packets. Instead, we use a program called adudump to infer the size of each transmitted video segment (each video segment is 4 seconds long and is individually requested by an HTTP GET). In fact, since these video segments are requested via HTTP GET, *the vast majority* of packets are already filled-to-the-brim with 1460 bytes of application-layer data. Only the last packet in the HTTP response will have less than 1460 bytes. Padding this last packet would just be noise to our algorithm.

      That being said, this technique could be employed at a higher level (which is a suggestion that we make in the paper). Instead of requesting fixed intervals (always 4 seconds of video with each HTTP GET), the Netflix video player could alter its requests to instead aggregate several consecutive segments, or randomize their order so that they are not requested sequentially.

      1. Crazy Operations Guy

        Re: Easy to prevent

        @ Networc

        Ah, that makes sense. In that case, I assume they are just grabbing an I-Frame + associated P-Frames, waiting for confirmation of reception, then sending the next I-Frame + its P-Frames. I thought they'd be sending based on portions of the video file as stored on the filesystem versus portions of video as stored in its container. Makes sense architecturally since the client would track state rather than being dependent on the server to do so.

        Perhaps the solution may be to re-encode the videos with a format that determines the placement of I-Frames on the total number of bytes changed since last I-Frame, rather than number of P-Frames since last I-Frame. Although that would mess with video seeking (although if nothing much is really changing, wouldn't you want to skip to beginning or end of that scene directly? Like if the scene is a new caster sitting still and addressing the audience, so really only the pixels making up their mouth would change from one frame to the next and you would either want to see it in its entirety or skip it in its entirety).

        1. Networc

          Re: Easy to prevent

          So, I will be the first to admit that video encoding starts to reach the limits of my expertise. That being said, when a video is encoded for DASH, it is encoded at multiple bitrates (quality levels) and the encoder is set to ensure that each time chunk (4 sec for Netflix) of each bitrate is playable in isolation, i.e. no Group-of-Pictures (GOP) spans consecutive time slots.

          That allows the client to jump to any time slot at any bitrate and not require the neighboring time slots. This is also what enables the transition between quality levels in response to bandwidth conditions.

          It's as if you're watching a playlist of many 4 second videos, each of which was retrieved via an HTTP GET.

  14. Tom 38
    Headmaster

    Not quite, maths

    This test isn't entirely accurate because of the small sample size. 100 titles generated 184 million data points, and under 4 minutes of watching one of those titles can determine which of the 100 titles was watched.

    Netflix have quite a bit more than 100 titles, which means a massive increase in the number of data points to consider. Let's be generous, and say their algorithm has reasonable time complexity and can be completely parallelised. What cannot be done is reduce the number of potential matches. With trillions of data points and millions of potential movies, the time that is required to give a definitive match will increase rapidly.

    1. Networc

      Re: Not quite, maths

      This is Andrew, one of the paper's authors. We first crawled Netflix to amass the full fingerprints for over 42,000 video titles (movies and shows). From these full fingerprints, we create a database that consists of every possible 2 minute window of each video. This constitutes the 184 million "data points".

      Prior to testing our approach on network traffic, we first checked every possible window (the 184 million data points) to see which windows "look like" other windows to our algorithm. Obviously it would be a waste of our time to try our technique on network traffic if video fingerprints look identical. There are only a very small percentage of videos that "look alike", and many of those are actually comprised of identical footage.

      So, with the confidence that our algorithm will not mistakenly identify a given 2 minute window of video, we tested it against actual network traffic (2 devices watching 100 video titles simultaneously).

  15. Lord_Beavis
    Pirate

    TOS Violation?

    "...harvested by setting up a server to automatically “watch” videos..."

    What's to stop someone from using a server like that to rip the videos?

    <voices from off keyboard>

    Oh? What, they have?

    I withdraw my question.

  16. fidodogbreath
    Big Brother

    So what?

    It's not possible to avoid tracking anywhere, either online or in meatspace. Your phone can be tracked passively everywhere you go by the carrier, IMSI catchers, MAC address, Bluetooth, etc. -- not to mention the built-in tracking "features" of the phone OS.

    SO you leave your phone at home when you go to the store. Your car is tracked en route by license plate scanners and security cams. You don't use the store loyalty card (because privacy), but you're tracked by your credit or debit card number. Paying cash? Well, every checkout lane has a security cam pointed at it that's time-synced with the register system. So boom, they got you with facial recognition -- and if you wear a mask or hat, then gait analysis.

    SO you stay home, and you avoid Amazon, Facebook, et al (because privacy). Meanwhile, the contents of your emails, calendar, cloud docs, etc. are scanned in order to serve you "relevant ads." Your online photos are scanned for child porn. Your keystrokes, mouse movements, app usage, web history, etc. are monitored by Windows 10 (with the default settings) and sent back to the MotherShip, in order to improve...something.

    SO you turn off the computer and phone, draw the shades, and turn on the TV. Netflix themselves are already tracking you, as is the device that you use to stream it. Plus, the smart TV is sending all kinds of info back to...someone, and streaming everything you say to a cloud server to support the voice recognition "feature."

    Hm, that's no good, so you unplug the router and switch to the antenna input. Your smart electric meter can be used to identify the over-the-air TV program you're watching by measuring the variance in electric usage by your TV, using a similar fingerprinting process to the one in the article.

    Given all of that, this traffic analysis "leak" seems pretty tame.

  17. James R Grinter

    Viewing figures?

    I can imagine Nielsen, and others, will be dashing off to try and implement this to get viewing figures for their customers that are currently unavailable to them.

    US ISPs, with their new freedom to sell off aggregate customer data, will be ideally placed to provide the network access.

  18. Anonymous Coward
    Anonymous Coward

    ye but

    if i ever found anything on Netflix actually worth watching in the first place,

    i'm pretty sure i'd (want to) keep that quiet too..

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like