back to article FYI: AI tools can unmask anonymous coders from their binary executables

Talk about the ultimate Git Blame. Programmers can be potentially identified from the low-level machine-code instructions in their software executables by AI-powered tools. That's according to boffins from Princeton University, Shiftleft, Drexel University, Sophos, and Braunschweig University of Technology, who have described …

  1. Anonymous Coward
    Anonymous Coward

    You got me, I'm the hacker that still uses goto and I never use OOP because procedural programming is much more 733t.

    I would love to believe this is possible but an accuracy rate of 65 per cent is neither here nor there. It does work well when identifying state sponsors though.

    1. John Smith 19 Gold badge
      Big Brother

      "It does work well when identifying state sponsors though."

      You have it backwards.

      It's very good for states to find anyone writing code they don't like.

      For those sorts of state a 35% failure rate is acceptable. *

      *"Better a 100 innocent men are punished than a single guilty man escape" as a well known psychopath once put it.

  2. JohnFen

    I'm surprised

    I'm surprised that it took this long, really. Anyone who's worked long enough with specific developers learns how easy it is to tell what code they've written based purely on their style. It's as unique as a fingerprint. It always reminded me a bit of the fact that in the old days of the telegraph, telegraph operators could identify each other based on their particular keying rhythms.

    1. Roger Varley

      Re: I'm surprised

      I would agree with you if we were talking about source code. But after passing through a compiler?

      1. Charles 9

        Re: I'm surprised

        "I would agree with you if we were talking about source code. But after passing through a compiler?"

        The compiler is still basically directed by the source code, so the end result is still going to preserve the essential coding style of the original writer. Code optimizations and code munging can change things some, but it's more like distorting a person's signature; the essential style characteristics embedded into the original code will still be there if you look carefully enough.

      2. JohnFen

        Re: I'm surprised

        Yes, even after passing it through a compiler -- although you need a larger code sample to be able to tell with accuracy, because you're relying more on macro patterns than micro patterns.

    2. Charles 9

      Re: I'm surprised

      And how handwriting analyzers can determine likelihood of a particular person writing through "grown" characteristics of the writer (style characteristics basically developed as a person acquired the skill to write).

    3. bombastic bob Silver badge
      Meh

      sample set is too small

      Seriously, the sample set is too small. If they'd used THOUSANDS of coders [or better still, MILLIONS] and been able to get a 65% accuracy on determining "who wrote this", I'd be impressed.

      And in the case of finding out who wrote an "illegal" program, this is what you'd have to be able to do.

      No fear necessary.

      1. harmjschoonhoven

        Re: sample set is too small

        That you are paranoid does not mean they are not out to get you.

        On the other hand, after three days without coding, life becomes meaningless.

        1. JohnFen

          Re: sample set is too small

          If they really are out to get you, then you aren't paranoid -- you're correct.

    4. The Man Who Fell To Earth Silver badge
      Boffin

      Questions questions...

      1. How well does it work if the programming language is uncommon and can't be determined from the binary?

      2. How well does it work if #1 is true & the programming language is Assembly?

      3. How well does it work if #1 is true & the high level language compiler allows inline Assembly so the programmer can randomly jump between the language & Assembly?

      1. Anonymous Coward
        Anonymous Coward

        Re: Questions questions...

        #4 and how dependent is the technique on the unique usage of libraries and other tools used by the programmer (versus the "programming style")

    5. Little Mouse

      Re: I'm surprised

      "I'm surprised that it took this long"

      Quite. IIRC something very similar was achieved in the world of regular books a few years back - reducing a given author's style to a digital fingerprint. A useful tool for proving the provenance of disputed authorship.

      I see no real difference here.

  3. StargateSg7

    My code is VERY EASY to identity...you can actually READ IT and UNDERSTAND IT!

    That is what happens when your first programming languages are PASCAL and COBOL...you have no other choice BUT to write easy-to-read and therefore very-identifiable code! I would be easily and QUICKLY found out by any investigator.

    I know one C programmer who would NEVER be able to be identified by this method because NO-ONE except himself can read his code and the only reason he is employed at all, is that his code is the FASTEST CODE AROUND PERIOD for embedded processors and specialty applications! He hasn't had to work on a financial basis since the early 1990's but because all the "Big Boys" of industrial and consumer hardware want him for his superior speed-up expertise, he keeps amassing a very large fortune by writing the world's most UNREADABLE C code!

    He knows every CPU optimization of every part of the various C compilers he uses down at the assembler code and register-usage level and there is NO WAY his type of code could be profiled at the assembler/binary level using "Stylistic Differentiation"...the optimizations he creates cause the compilers to output only the most basic and reduced instructions.

    1. Charles 9

      Have you ever thought that what you describe in itself is a coding style? Meaning he CAN be identified?

      1. StargateSg7

        TRUE! But I think that if he wanted to, he can just mask his optimizations and makes his code look like generic output. Anyone who knows how to modify operating system and hardware driver assembler code WHILE ITS RUNNING can mask his code-style traces to any level he so desires.

        1. Charles 9

          But that would be like altering one's handwriting to mimic another: unnatural to the practitioner due to force of habit. Plus I don't think there IS such a thing as "generic" code since most code is made by man, which means each snippet will have a style signature.

    2. Gene Cash Silver badge

      UN-altered REPRODUCTION and DISSEMINATION of this IMPORTANT Information is ENCOURAGED, ESPECIALLY to COMPUTER BULLETIN BOARDS.

      1. Alan Brown Silver badge

        "UN-altered REPRODUCTION...."

        Uh oh, nobody talk about turkeys.

    3. JohnFen

      "I know one C programmer who would NEVER be able to be identified by this method because NO-ONE except himself can read his code and the only reason he is employed at all, is that his code is the FASTEST CODE AROUND PERIOD"

      That sounds like it would be exceptionally easy to identify.

    4. Anonymous Coward
      Anonymous Coward

      "My code is VERY EASY to identity..."

      My style analysis tells he can be the same guy writing here, which led to more than a good laugh:

      http://www.canonrumors.com/forum/index.php?topic=33975.0

      We are waiting for its magnificent code to appear, and perform a style analysis on it.

      1. StargateSg7

        Re: "My code is VERY EASY to identity..."

        Ya Got Me! --- I'm ONE AND THE SAME!!!! and YES my CODEC will be released very soon now. I do have a day job and my employer needs my expertise in video production and coding (everyone does everything.at this company - i.e. Multi-tasking!) so I can only work on it in my off-hours.

        Here is the basic outline of the code which is MOST READABLE AND UNDERSTANDABLE

        quite unlike my colleague from years ago:

        Threadsafe_Global_Variables:

        Final_Video_Output_Filename : Character_String_Type;

        Frame_Buffer_Images,

        Processed_Output_Images : Array[ ONE..Maximum_Frame_Group_Length ] Bitmap_Image_Type;

        Threadsafe_Global_Constants:

        Maximum_Frame_Group_Length = 120;

        Program_Begin

        Show_Destination_Output_File_Dialog( Final_Video_Output_Filename );

        Call_High_Resolution_Interrupt_Timer( ONE_HUNDRED_TWENTY_TIMES_PER_SECOND );

        End_Program;

        Define_CODEC_Procedures_and_Functions:

        Procedure Interrupt_Timer_Event_Handler( Number_Of_Frames_In_Group: Signed_Integer_Type );

        Var

        x, y,

        Frame_Number : Signed_Integer_Type;

        Destination_Video_File : File of Compressed_Video_Frame_Type;

        Begin

        Try

        Keep_Within_Limits( Number_Of_Frames_In_Group, ONE, Maximum_Frame_Group_Length );

        for Frame_Number := ONE to Number_Of_Frames_In_Group do

        Begin

        Ingest_Current_Frame_From_Camera_Buffer( Frame_Buffer_Images[ i ] );

        for y := ONE to Height_Of_Image do

        for x := ONE to Width_Of_Image do

        Begin

        Process_Current_And_Neighbouring_Pixels( Frame_Buffer_Images[ i ],

        x, y, Frame_Number,

        Processed_Output_Images[ i ] );

        End;

        if Frame_Number = Number_Of_Frames_In_Group then

        Save_Group_of_Frames( Final_Video_Output_Filename, Number_Of_Frames_In_Group );

        Stop_And_Exit_Compression_Program_Whenever_Main_Window_Is_Closed;

        Except

        Handle_Overflow_UnderFlow_NAN_Exceptions;

        End;

        End;

        Procedure Save_Group_of_Frames( Output_Filename: Character_String_Type;

        Number_Of_Frames_In_Group: Signed_Integer_Type );

        Begin

        Try

        Open_File( Destination_Video_File, Output_Filename, APPEND_TO_END_OF_FILE );

        for Frame_Number := ONE to Number_Of_Frames_In_Group do

        Save_Compressed_Video_Frame_To_File( Processed_Output_Images[ i ] );

        Close_File( Destination_Video_File );

        Except

        Handle_File_Exceptions_Here;

        End;

        End;

        Soooooooooo........Can you read and understand this????? If you can then I did my job!

    5. Nolveys

      My code is VERY EASY to identity...you can actually READ IT and UNDERSTAND IT!

      I can THINK of ANOTHER reason that SOMEONE could POSSIBLY IDENTIFY your CODE.

    6. Adam 1

      > the only reason he is employed at all, is that his code is the FASTEST CODE AROUND PERIOD for embedded processors and specialty applications!

      Maybe you could forward his CV to Intel. Heard they may be interested in someone who can work the fastest code around period.

  4. Adrian 4

    Optimisations

    We're often told that the tricks we learnt to get code to execute faster back in the days before good optimisation are worthless because a decent compiler will do that anyway.

    This research gives the lie to that idea. Code written with DIY optimisations is substantially different from code written primarily for clarity. If this technique isn't defeated by compiler optimisations, then the optimisations are pretty unimpressive.

    1. Charles 9

      Re: Optimisations

      Not necessarily. It could just be a "six of one, half a dozen of the other thing": more than one way to get comparable results.

    2. Richard 12 Silver badge

      Re: Optimisations

      Some of the older tricks execute more slowly on modern CPUs, because the balance has changed.

      Eg: Lookup tables can now be slower than recalculating, because the table doesn't fit in a cache line but the calculation does.

      Taking advantage of SSE and AVC is often faster than loop unrolling.

      You can do any of these manually, but having done so, you probably won't revisit and change it when the balance changes and another optimisation technique becomes better.

  5. Frozit

    Now, RISC code after full optimization might be harder.... That stuff is strange.

  6. JeffyPoooh
    Pint

    Tables, nearly Code-Free State Machines, and future Requirements Compilation

    Once upon a time (early 1980s), there was a coding contest to see how much functionality could be crammed into one line of BASIC code; limited to about 240 characters. I arrived at a way to have a one line 'engine', and then as many subsequent DATA statements as you wish. With the extra DATA lines, it wasn't really a 'One Liner' winner, oh well.

    Each DATA statement was conceptually a row in a table, and each row effectively encoded a machine 'state'. The data elements were: State ID#, assigned action or output data, then an extensible list of condition values with their next state ID#. The program inputs caused the engine to jump around the table based on those inputs, as designed and listed in the table.

    Essentially all the states of the machine would be coded into a big dumb table, and the actual code was simply a very tight little loop.

    It's a powerful concept, in applicable circumstances. Put your machine states into a trivial table format, automatically transcribe it in, and then add the one line engine. Done.

    The same thing could be done in assembler. A wee tiny bit of actual code, and then a huge table making it sing. The big silly table could be prepared in MS-Excel, even by a manager.

    It's a small step from the above concept to that (soon to be here) future of Requirements Compilation directly into code. Spec writers become coders.

    This sort of Table Driven State Machine coding method is very nearly code free. In case that helps.

    1. frobnicate

      Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

      You just described how professors work. "Tables" = "machine code".

      1. Anonymous Coward
        Anonymous Coward

        Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

        Did you intend "processors" rather than "professors"? If so, then yes, although I constrain, and validate like hell, to a limited set of instructions. Just me being me though. I think.

        1. Destroy All Monsters Silver badge

          Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

          These are NOT finite state machines.

          These are Turing Machines.

          Which a interpreted by a lower-level Turing machine. With Polynomial slowdown.

    2. Anonymous Coward
      Anonymous Coward

      Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

      I've always made extensive use of finite state machines as there was no way in hell I was going to let my code execute non-deterministically if at all possible. Breaking out or doing the unplanned was a ticket to, as I've said quite often, a federal prison should things blow-up or people are harmed or killed. So, you've pretty much described my style, no matter what the tools.

      As to stylometry, I've not a worry in the world. I've not got code out there at all accessible. Not that I want to go off the reservation. Quite the contrary. Still, reassuring.

      1. Anonymous Coward
        Anonymous Coward

        Re: Tables, nearly Code-Free State Machines, and future Requirements Compilation

        "I've always made extensive use of finite state machines [...]"

        I remember reading a paper on finite state machines in 1974 when producing a spec for a protocol driver. My design produced an "engine" that depended on one instruction in the machine's code set that proved very efficient for that purpose. It was too novel for the person who did the development - who coded it in more linear fashion. He did acknowledge it was the most complete spec he had ever used.

        Even without using FSM tables per se you can produce data driven code that is basically an "engine". Over the years I have used many of my designs for totally new purposes. Not the most efficient at run time - but quick to implement an enhancement or a new use.

  7. Anonymous Coward
    Anonymous Coward

    MOV R0, #1

    .loop

    ADD R0, R0, #1

    CMP R0, #100

    BLE loop

    Try de-anonymising that. While you could probably analyze C because everyone who writes it, uses their own unique style and way. I doubt its in anyway practical with assembler since theres only really 2 ways to write that looped or unwound.

    1. Charles 9

      OK, exactly where does the snippet fit into the rest of the code, how does the code around it mesh with the loop, do you use CMP #100/BLE or CMP #101/BL instead? Or perhaps start with MOV #100, DEC, and BNZ instead (to skip the CMP step)? Just saying there's more than one way to skin a processor.

  8. Anonymous Coward
    Big Brother

    a rent on life, middle-mannig it with code

    That's easy to avoid, just copy everyone else's code

    it all does the same thing anyway and your' just wanting to seem indispensabile.

    a rent on life, middle-mannig it with code

    1. Anonymous Coward
      Anonymous Coward

      Re: a rent on life, middle-mannig it with code

      When I did support programming I always used to imitate the style and intentions of the original author so that the change was seamless.

      That took time to understand how the original code worked. Development colleagues often just grafted on a blister of code in their preferred style. Often fixing the symptom rather than the underlying problem.

      1. Alan Brown Silver badge

        Re: a rent on life, middle-mannig it with code

        "Often fixing the symptom rather than the underlying problem."

        Apart from having just described Microsoft's model for the entire 1990s-2000s, in a lot of cases the underlying problem WAS the style and intent of the original author.

        On more than one occasion the correct fix was to replace the code entirely.

  9. Brian Miller

    Reproducibility

    Take a look at the source code of theirs on Github. Decompiling binaries back to C is so messed up it's not funny. Sure, they got something. However, this is something that bears examination, and I really question what they did. How much picking and choosing did they do for their data sets? Did they throw out code that didn't reliably decompile? Because I have some stuff I'd like to see how the Snowman decompiler does on it.

    Also, their "obfuscation" was a bit on the trivial side, using the llvm obfuscator.

    I would like to see more work on this, and see if this is reproducible with different compilers, different options, etc. They seem to have tried one thing at a time, and not combinations.

    1. Richard 12 Silver badge

      It doesn't matter

      What is important is to use the same decompiling toolchain with all the samples as well as the code-under-test.

      This is looking for common patterns. It doesn't really matter what the patterns look like, only that they exist.

      All you actually need is large samples of binaries of known provenance to compare against.

      The technique is new, the theory isn't. Malware has been traced back to specific groups (named or otherwise) many times.

  10. TRT Silver badge
    Devil

    Thank goodness...

    I've only ever released StackOverflow copypasta.

    1. Paul Hovnanian Silver badge
      Pirate

      Re: Thank goodness...

      Came to copy and paste this same comment. Left satisfied.

  11. Anonymous Coward
    Anonymous Coward

    "Another is using a different identity for every bit of code released."

    Would that not reveal a common identity shared between the several identities? It would still provide a significant modus operandi that might then be correlated with another linking factor.

  12. Omgwtfbbqtime
    Trollface

    Hmm...

    So it can/might identify a coder from their repositories...

    So for anything you release to the public, make it all your own work.

    For anything you don't want tracing back to you use a good spread of copypasta.

    Or just use off the shelf stuff and let the blame fall elsewhere.

    1. Anonymous Coward
      Anonymous Coward

      And the takeaway is

      Pay someone else to write your malware.

  13. Christian Berger

    So how is this different to other kinds of stylometry...

    ... which you can easily get around by just write code in another style. Having stylometry even allows you to modify your code gradually so it'll look like code from someone else.

  14. Anonymous Coward
    Anonymous Coward

    Variation Space, a bit like Address Space

    They'll have identified some characteristics, each with several possible values (style used). These together define the maximum possible "Address Space" of this identification scheme. E.g. 10 characteristics with 4 choices each is equivalent to 20 bits, or one in a million.

    But then they'd need to account for 'Bell curves' where the possible values are not evenly used. Then they'd also have to account for the correlation across characteristics. The effective Address Space will be a fairly small fraction of the theoretical space. Probably an order of magnitude, maybe two orders, effectively smaller. E.g. ballpark one in 30k.

    These are just the extremely basic Address Space considerations. What about: Noise, Deception, Unknown Libraries, Obfuscation, Misunderstood Processes, Copying, Sample Code, etc. ?

    Although not up to Evidence standards, it might have some value as an Investigative Tool, but positive or negative value? Positive value to the actual perpetrator who was able to frame someone else by copying their code and mimicking their style for the changes?

    There's a growing pile of discredited "forensic sciences" (sic). I suspect that a space should be reserved for this one.

    1. Anonymous Coward
      Anonymous Coward

      Re: Variation Space, a bit like Address Space

      The Address Space conceptual analysis described above is perfectly sound. Although somewhat trivial, it is a missed step far too often. The conclusions follow directly from it.

      Their only escape from this crtique is if they have a large library of such coding-fingerprint characteristics, and have already accounted for the Bell curve and internal correlations limiters.

      Downvotes without explanation are a bit pointless.

  15. Pier Reviewer

    StackOverflow

    Some guy on stackoverflow.com is going to have a lot of code attributed to him...

  16. EveryTime

    There are so many caveats and limitations that this is almost a click-bait headline.

    Unstripped binaries of certain high level languages run through common compilers pull a whole bunch of source code information into the object files.

    Strip the binaries and that information goes away. What you are left with is general characteristics of the programmer. You'll be able to sort programmers into groups, and identify that a program likely came from a specific group, but it's really unlikely that you could make compiled code to a specific programmer.

    Now if you are pattern-matching against re-used code from github, that's a different story. With a big enough code sample, you have a pretty good chance of seeing where the programmer cribbed from. And probably mis-identify the code as having been written by a prolific programmer that is often copied from.

  17. Dwarf

    Teamwork, compilers and statistics

    So what happens when a team works on an app or it moves from programmer to programmer as teams change and code still needs to be maintained over multiple years.

    Similarly, what happens as the compilers and linkers evolve - what does the supposed optimisation voodoo do to the code.

    Looks like pure statistics to me and as we all know, you can prove anything with statistics. Last I heard, 76.45% of statistics were made up on the spot.

    1. Charles 9

      Re: Teamwork, compilers and statistics

      But then, is that itself a statistic made up on the spot? Fake stats all the way down?

      1. Dwarf

        Re: Teamwork, compilers and statistics

        It would be ironic if you couldn't spot the irony in the previous statement.

        Of course its made up on the spot, that was the whole point.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like