Everything you always wanted to know about malware detection, but were afraid to ask.
How is malware detected?
This is an example of a simple question that will require an answer which spawns into several sub-questions with accompanying answers before we can consider this question to be answered.
First, we need to figure out in what ways we might encounter or ‘see’ malware during it’s travels in our networks and systems. And to make things a bit simpler, we stretch the question somewhat so as to include detecting malware presence by behaviour as well.
– Ok, so where can we find or ‘see’ malware or it’s behaviour?
In the data connections; malware needs to travel to our systems and it has to do so either via the network cables, or via cascading infections of storage methods, like USB/bluetooth, or via exotic methods like speaker and microphone.
No matter which method, the fact remains that to this day there has been no recorded instance of malware abiogenesis.
If we were to look very closely and with all of our attention to all these ‘transport methods’ we might be able to see the malware ‘jump’.
Also, if the malware is designed to do anything other than just incapacitate it’s victim, we might be able to see aberrant data on these transport methods.
In the storage sub-systems of the victim systems. Unless the malware authors are so confident that the victim never scrubs it’s memory (by active cleaning, but also by rebooting) or that the victim can be reinfected at a moments notice, the malware needs a place to persist, and possibly to store some intermediate data (like where it already has been, passwords it has snooped from the network, etc…).
Since most operating systems still allow raw access to storage devices there are quite a lot of ways to hide, but most malware does not bother with extreme sophistication. Instead a lot of malware depends on the fact that most systems have thousands of files right after installation and that a lot of filenames of valid system files are cryptic to begin with.
The malware then chooses a writable directory and puts itself and it’s files there under names that will not quickly arouse suspicion.
It may also modify legacy files that are almost never used or use databases that are running on the system.
In the hardware. Hardware is becoming cheaper, smaller and complexer. These three conditions make for more difficulties in verifying hardware designs before they are mass produced. Malware can be programmed into the microcode using Verilog, but it has also been shown that malware can be added in the final stage using the litho’s just before the chips are being produced.
And also in the BIOS/UEFI… see the example of badbios.
– Right, we have defined the arena, but besides malware there is lots of other stuff going on in our systems, so how do we detect the malware?
Well, we are not yet at the point where we can answer that question. We need to decide what to watch and when. This might severely impact the performance of our systems on the one and and impact the accuracy on the other hand.
It is very tempting to just read all files on every machine and open up every single network packet so as to assemble a complete view on what goes on inside your systems and networks but it will not work very well, if at all.
First there’s the immense can of worms called ‘privacy’, followed by ‘information segregation’ and ‘security’. After all, if you can read everything, then so could a potential attacker.
Next we have to consider storage space and computing speed; having a data-center thrice the size of your normal operations just to check for malware is not very cost effective (but might work if money is not a problem).
And finally we have to acknowledge that end-to-end encryption will put a veil over much of what we possibly could see.
– So, how do we decide what to watch?
We need to figure out what the minimum of information is that enables us to discriminate between bad activities and normal everyday activities and what actually constitutes bad activities.
There a few stages in the live-cycle of malware that we could detect (some would have it to be the nice number of 5 stages, others 8, but it all comes down to granularity), most notably the infection-stage and what I would call the ROI-stage.
The infection stage can be via so many different attack vectors and they can be fragmented, via software, hardware or human interaction, that it is quite hard to do a really good detection. Some anti-virus vendors have become good at this but it requires a tremendous effort to keep the detection at a level that is practical and affordable.
The Return-On-Investment-stage (ROI-stage) is a bit different; it needs to be reliable and consistent to make the effort of setting up the whole infrastructure and cloaking it (after all, malware is illegal) worthwhile. It also needs to use the victims infrastructure to try and evade any security measures. So, this is bound to be less diverse and it has other constraints as well (a zipped file of 2 Gigabyte is still 2 Gigabyte that you have to transfer over the network somehow).
To make an analogy; there are lots of ways burglars could gain entry into a house, but there are very few ways they can easily get the 52″ television and the desktop computers into their transport.
For RedSocks it therefore makes sense to mainly check outgoing (egress) meta-data of traffic. This has the added bonus that encryption plays much less of a role since the meta-data of a flow can tell us a lot about the kind of data inside the packets without ever having to know what is exactly inside the packets.
And processing meta-data is lots more efficient than processing all data packets themselves. Again, an analogy; it is easier to understand that someone has gone from A to B at time T than it is to reconstruct that information by understanding that this person has placed her left foot before the right foot from point A, then the right foot, then the left foot, then the right etc… all the way until we get the left foot at B.
– Cool, now we can detect malware! How do we do that?
Let’s look at some methods in order of accuracy (of course, depending on the implementation)
One way is to use something called blacklisting (usually combined with white-list and grey-list for better control over granularity). This is the fastest and most accurate method of discriminating good behavior from bad behavior since it involves relatively simple lookups for each item of meta-data that needs to be checked. The real hard work, and where the intelligence is needed, is to compile an accurate and up-to-date list.
This, in combination with the limited amount of meta-data makes for a very accurate and fast method to detect malware.
A drawback of this method is that it is a form of ‘Closed-world assumption’ in that anything which is not on the blacklist is not ‘bad’ and therefore this methodology might miss behavior which has not been sufficiently observed yet.
Another way is using heuristics. The term heuristic comes ultimately from the Greek heuriskein which means ‘to discover’. Heuristic techniques are so called ‘trial-and-error methods’ because they work on assumptions and therefore are not guaranteed to give only correct results.
An oversimplified example of a heuristic approach is ‘All traffic that comes from an office machine after 0:00 is evil’. This will of course fail if someone has forgotten to close some programs like email programs or chat programs.
Heuristics can work very well and quite fast with meta-data, but their flexibility is limited.
Yet another method is statistics/anomaly detection. An anomaly is something that deviates from the common rule; an irregularity. Using statistics we can give probabilities to whether or not a given traffic pattern is likely to be part of the normal activities or not.
To find anomalies, we have to decide some specific parameters of network traffic that we are going to watch and analyze. Then we need to create a ‘baseline’ or ‘learn’ the probabilities of all of these parameters occuring in the sequence that they are occurring. Basically we count all the occurrences of each pattern and count the occurrences of what came before this pattern.
In practice this means letting the system just run for some time, recording network traffic as it happens.
Although this method is very flexible and, when properly implemented, can adjust itself slowly as the behavior on the network changes so as to give few false positives, it does have a few drawbacks:
An alert generated can tell you only that it is anomalous, not whether it is bad anomalous or just ‘never seen before’ anomalous, but also a potentially dangerous drawback; the baseline might include malware that was installed before the learning phase commenced.
And finally we can let the computers figure it out themselves by using machine learning. Machine learning is basically when programs learn from data without being explicitly being programmed. The opposite of the first method.
Machine learning is usually (though not always) implemented in a manner that uses large formula’s (sometimes called neural nets) in which a lot of parameters are being manipulated on the basis of the input and output. If the output (or prediction) is not correct, then the parameters in the formula get tweaked a little bit. Then the next input is taken and the output of the formula is checked again. You can visually have a go yourself using Tensorflow.
There are many pitfalls in using machine learning and at the moment it still needs a lot of computational power to make any form of accurate predictions and the quality of supervised training sets is still a big issue, but it can help malware analysts in lab conditions already.