Generative Adversarial Networks for Maldev

Atlan Team

Introduction:

While many out there may not be familiar with Machine Learning and the various types of ML models that can be developed, this post will focus on GANs - Generative Adversarial Networks and present a short introduction and some suggestions of an approach, that we ourselves are following.

A GAN is defined as:

A generative adversarial network is a class of machine learning frameworks designed by Ian Goodfellow and his colleagues in June 2014. Two neural networks contest with each other in a game. Given a training set, this technique learns to generate new data with the same statistics as the training set.

While this definition is useful sometimes a graphic can say a thousand words as below. Essentially you are pitting one Machine Learning model against another:

While not a huge amount of work has been done in the private sector related to harnessing and weaponising GANs for malware development, the Chinese Academy of Sciences did release a technical paper in 2017, with accompanying code here, and at Hunnic Cyber we had developed our own in-house GAN for Malware Dev in 2020 before a series of events led to that firm shuttering and so we could not release the tool publicly.

What this post, in combination with my upcoming post around attacking Microsoft ATP seeks to present, is a research idea for ambitious hackers & ML engineers to develop a Generative Adversarial Network outlined in the technical signposts below.

Technical Signposts

While the technical paper released by the Chinese Academy of Sciences developed their GAN with a Black-Box model in mind, the offsec community has been acting as a human GAN against the ML models in ATP, SentionalOne, Crowdstrike and others for several years now, and in my upomcing post I summarise some of the fantastic research done by other consultancies, researchers and developers and present a Proof of Concept.

I am of the opinion that enough now is known about these EDR ML models, for the security community to begin automating some of the development of their malware.

In my post around ATP I outline that it's clear that Microsoft have been training their models on certain assumptions & indicators related to malware, and so it makes sense to turn this approach on its head.

I believe it is possible to make much clearer assumptions as to how to develop malware with a GAN than in 2020, and I believe it is possible for a researcher to apply these concepts without vast computational overhead requried.

While Atan Digital is dedicating time to redeveloping our own GAN for Malware Dev, I will outline a few of the key steps that ambitious researchers may find useful to get going:

You will need to develop your own custom lab to be able to interface with Microsoft ATP - I have outlined in my blog here, how you can get going for free with a lab that you can set up
While AMSI has been our adversary in malware development until now and has since been baked into the Common Language Runtime and other Windows DLLs, for developing your GAN, AMSI is your friend now and acts as the discriminator allowing the feedback to further train your GAN to become more effective. Every malware sample that you generate, can now be queried against the AMSI API, and you can quickly generate a feedback mechanism that enables your GAN to be fine tuned. Developing your own limited .NET compiler will allow you to generate samples with a limited set of recurring code - for example the shellcode and the shellcode execution elements - and a vast set of template non-malware code that would comprise much of the additional code within your application. Since 2020 there has been progression of all tooling, so you may find that code completion offered by Microsoft can assist you in your adventures!
While the Chinese Academy of Science's approach was to take existing malware samples, as I have outlined, it appears that Microsoft have made some logical assumptions that are easier to attack. Therefore for the training set, we seek to train our model around developing malware based on the characteristics of NON-MALWARE samples, compiling with ample junk code, real method & function names and so on and so forth. Dynamic evasion is much more complex but our later posts will follow up on this and how we approached that area.

I will leave a few links below that could help you along your journey:

USEFUL LINKS

Generative Adversarial Networks Specialisation (Andrew NG) - https://www.deeplearning.ai/courses/generative-adversarial-networks-gans-specialization/

Anti-Malware Scan Interface - https://docs.microsoft.com/en-us/windows/win32/amsi/antimalware-scan-interface-portal

Visual Studio IntelliCode - https://visualstudio.microsoft.com/services/intellicode/

IntelliCode API Usage Examples - https://www.vsixhub.com/vsix/80700/

Create a Language Compiler for the .NET framework - https://docs.microsoft.com/en-us/archive/msdn-magazine/2008/february/create-a-language-compiler-for-the-net-framework-using-csharp

Learn ML.NET - https://dotnet.microsoft.com/en-us/learn/ml-dotnet

Anaconda - https://www.anaconda.com/

Synthetic Time-Series Data: A GAN approach -https://towardsdatascience.com/synthetic-time-series-data-a-gan-approach-869a984f2239 (quote from article below)

What’s new about TimeGAN?

Different from other GAN architectures for sequential data, the proposed framework is able to generate it’s training to handle a mixed-data setting, where both static (attributes) and sequential data (features) are able to be generated at the same time.

Less sensitive to hyper parameters changes

A more stable training process, when compared to other architectures.

Training

Machine Learning for Red Teams

Hands-on training on ML foundations, clustering, classification, and model abuse for operators.

View training

Atlan Team

Introduction:

Machine Learning for Red Teams

ENQUIRIES

General Enquiries

New Business