Effective C2 Beaconing Detection

Introduction Adversarial C2 Frameworks Current Detection Approaches

IPS Signatures IP/URL Block Lists Network Traffic Heuristics Efficacy

Comprehensive Signals Anomaly Detection Training Data Granular Risk Metrics Evaluation and Testing Design Considerations

Benefits

Anomaly Detection of New Unknown Threats Comprehensive Signal Analysis Adversary Toolkit Detection Detection Efficacy

Conclusion References

Introduction

Attackers are adding new and sophisticated command-and-control (C2) capabilities in their malware that easily evade common static defenses based on IPS signatures or IP/domain/url block lists by using common, widely-available C2 framework tools like Cobalt Strike, Brute Ratel, Mythic, Metasploit, Sliver, and Merlin. These tools provide post-exploitation capabilities including command-and-control, privilege escalation, and actions on host and were originally designed for penetration testing and red team operations.

However, attackers have hijacked and embedded these same toolkits for malicious purposes as many products are open-source, such as Mythic and Merlin, while other commercial products, such as Cobalt Strike and Brute Ratel, have been stolen by attackers through hacked copies or leaked source code. This has effectively turned these same tools into adversarial C2 frameworks for malicious post-exploitation.

The tools can easily shape and change many parameters of C2 communications, enabling malware to evade current defenses even more easily and for longer periods, causing greater damage within victim networks, including: stealing more data, discovery of more valuable data, unavailability of business apps/services, and maintaining hidden access to networks for future damage.

Current approaches to detecting the latest malware using C2 frameworks utilize static signatures and indicators including detection of implant executables, IPS signatures for detection of C2 traffic, and IP/URL filters that are inadequate to dealing with the dynamic, malleable profiles of the widely-available C2 framework tools.

A new approach is required that is not so rigidly tied to known attacks, but based on anomaly detection of a comprehensive set of signals fed into trained machine-learning models with fine-grained tracking of device and user risk. This approach will supplement existing approaches, but can dramatically increase detection rates while keeping false positives low and future-proofing against evolving C2 traffic patterns that are easily enabled by these same C2 framework tools.

This paper discusses the gaps in current approaches and the increased efficacy from using a focused machine-learning approach with additional network signals and fine-grained risk metrics based on models at the user and organization level. We also discuss some of the key challenges in testing the efficacy of any C2 beacon detection solution.

Adversarial C2 Frameworks

Cobalt Strike, Metasploit, Mythic, and Brute Ratel are some of the commercial and open-source adversary simulation tools originally designed for red team testing of malware detection. These toolkits are sometimes referred to as threat emulation tools or C2 frameworks as they provide a rich feature set (Gill) for simulating real threat activity during red team operations with a focus on the post-exploit command-and-control parts of the attack chain.

We may use some of these terms interchangeably throughout the paper but will generally use C2 frameworks to emphasize that these tools are being used by malicious actors to impact production environments and that the problem to solve is much more than simulations or emulations by friendly internal red teams.

These C2 framework tools have been embedded, hacked, or stolen and used by numerous attackers (“Cobalt Strike: International law enforcement operation tackles illegal uses of ‘Swiss army knife’ pentesting tool”), including nation-state actors such as Russia’s APT29 in SolarWinds (“SolarWinds Supply Chain Attack Uses SUNBURST Backdoor”) and the PRC’s TA415 (Larson and Blackford) to enhance and evolve the stealthy communication capabilities of various RATs, botnets, and C2-enabled malware.

Cobalt Strike is the most popular C2 framework tool, and we use it as a specific example throughout this paper, though the observations apply to all similar tools. The following Cobalt Strike high-level architecture diagram shows its basic components (Rahman) and the runtime attack flow.

Figure 1: Cobalt Strike High-Level Architecture

#	Attack Step	Description
1	Initial access / infection	Initial infection vector, including downloader and loader for the beacon payload.
2	Call home (C2)	Beacon calls home to Team Server utilizing HTTP/HTTPS/DNS typically. May utilize domain/IP obfuscation via redirectors such as proxies, domain fronting (e.g. CDNs) or domain masquerading. Beacons may also chain communications to bypass internal network segmentation.
3	Attacker command and control	Attacker controls Beacon, issuing various commands. May utilize Aggressor Scripts to automate/optimize workflow.
4	Execute commands	The Beacon may use Execute Assembly (.NET executables) in a separate process or Beacon Object Files within the Beacon session/process, extending post-exploit capabilities. Memory injection is used to evade detection from endpoint defenses focused on files and disk activity associated with malicious files.
5	Actions on host	Numerous built-in actions are provided for new capabilities via extensions as BOFs or Execute Assembly.

Table 1: Attack Chain using Cobalt Strike C2 Framework

Cobalt Strike and similar toolkits allow for an easy and wide range of configurability in the HTTP/S traffic, producing C2 traffic that often appears benign, looks like normal HTTP/web traffic, and is similar to web browser or popular application traffic. There are default configurations provided with the tools that emulate both known malware as well as known valid applications.

Although DNS is also supported as a C2 protocol, we will focus the discussion on HTTP/S C2 as it reflects the bulk of network traffic in/out of an organization, is more complex due to the variety of applications using HTTP/S, and attracts the majority of malicious actors who are trying to hide amidst the network noise including legitimate benign C2 beacons.

The toolkits are highly configurable (via malleable profiles) and can easily vary the timing, frequency, volume, application protocols, destination IPs/domains, user agents, HTTP headers, HTTP verbs, URIs, parameters, SSL/TLS certificates, beaconing delay with random jitter, and payload/content. C2 framework tools also allow for a large number of post-exploitation actions, which are encrypted, downloaded, and run in-memory, making post-compromise activity very difficult to detect on endpoints.

We will focus on the specific C2 communications capabilities of the C2 framework tools (e.g., C2 beaconing), and how easily the communications are changed (e.g., via Cobalt Strike’s C2 Malleable Profiles), and the challenges posed to organizations trying to detect stealthy malware.

There are multiple good resources discussing the functionality of Cobalt Strike’s Malleable Profiles (Gill), but we’ll point out some of the commonly used features. Here is a snippet of the malleable profile for mimicking the Gmail browser application in Cobalt Strike (Mudge):

Figure 2: C2 Malleable Profile (gmail)

Some of the key functionality and areas of the profile are:

Section \| Settings	Description \| Capabilities
`https-certificate`	# Use an existing certificate or generate a self-signed certificate as seen in this # example.
`global options`	# These global options below set the C2 beacon sleep time to 60 seconds with a # random jitter of +/- 15%, showing the ability to vary the call-home timing to avoid # easy detection. `set sleeptime "60000"; set jitter "15";` # Other global options specify on-host post-exploit action parameters such as the # process name spawned to execute commands using in-memory injection or the # pipename used for IPC communications. These are not relevant to C2. `set pipename "interprocess_##"; set spawnto "userinit.exe";`
`http-get`	# The uri path used for beacon->server communications can be varied with a list `set uri "/_/scs/mail-static/_/js/";` # Client (beacon->server) communications including cookies, headers, and encoding # can all be specified and varied easily at the HTTP protocol level `client { metadata {} header {} }` # Similarly, server->beacon communications can also be varied at the HTTP # protocol level `server { header {} }` # Cobalt Strike allows shaping of the 2-way communications flow between the # Beacon client and C2 Team Server (“A Beacon HTTP Transaction Walk-through”): `# 0. http-stager {} optional stager to download full Beacon # 1. http-get {client} client -- call home → server # 2. http-get {server} server -- cmds → client # 3. http-post {client} client -- cmd output → server # 4. http-get {server} server -- confirm → client`

Table 2: C2 Malleable Profile Description (gmail)

As can be seen above, simple modifications of these profiles can easily change C2 communications behavior to mimic common applications, their beacons, and web traffic. There are more than 240 public malleable profiles for Cobalt Strike alone, readily available for use or that can be easily modified.

Current Detection Approaches

Current approaches for detecting malicious C2 traffic tend to match hard-coded byte signatures or use regular expressions to match payload or headers (IPS signatures), or are based on matching of IP/domain/URL lists. These approaches are static and easily evaded by the dynamic, configurable nature of the C2 framework toolkits being embedded by attackers.

IPS Signatures

To illustrate the challenges with IPS solutions, here is one of the Snort rules to detect the Zeus Trojan (Snort):

Figure 3: Snort Rule (Zeus Trojan)

Snort and many IPS solutions allow various matches of content or headers at layers 3 and 4, as well as at the application level as indicated by the action verbs in the rule. Many matches such as the content rule option are static byte/character matches, while the pcre rule option is a regular expression match.

When looking side-by-side at both the adversary side (e.g., the C2 Malleable Profile for gmail seen previously) and the the defensive side (e.g., the Zeus snort rule), the static hard-coded matching and fragility is clear. Imagine that an attacker had created and deployed a new Zeus variant that used Cobalt Strike and a Snort IPS had the above Zeus rule in place that effectively detected the new malware. The attacker could easily change one character in the profile, such as adding a space in MSIE to avoid matching: content:"|3B 20|MSIE|20|"; and the malware could evade the IPS signature.

While there is contextual awareness and state tracking, the IPS signature approach is inherently limited due to its static matching, which results in false negatives and easy evasion (literally changing one character in one field could bypass an IPS rule).

This is not to say that IPS solutions aren’t useful. Rather, IPS signatures should be retained, as they serve a useful perimeter defense, blocking many known network exploits in a quick and efficient manner. In this case, even if an IPS only achieved 60% detection rates, that 60% can be blocked/alerted on easily, avoiding costly downstream processing.

IP/URL Block Lists

Other traditional approaches, such as the use of block lists (IP or URLs), are often applied in an effort to prevent the initial access or download of malware during web browsing, as well as to block potential C2 traffic.

A common challenge with block lists is that they are often out-of-date, causing false positives and are reactive in that they are updated post-compromise of target #1 or patient zero.

This is exacerbated by IP/domain indirection techniques used to hide the domain or IP address of the C2 server. Cobalt Strike has redirectors, which could be as straight-forward as IP proxies, to obfuscate the true domain or IP of the C2 server. There are also other techniques such as domain fronting using CDNs, or domain masquerading, that take advantage of TLS (SNI) vs HTTPS (host) mismatches to hide the final malicious domain from some URL security filters.

Network Traffic Heuristics

A different approach involves the use of heuristics, typically applied to network traffic patterns based on volume or time. The classic example is to detect regular outbound communication (e.g., every 60 minutes), perhaps to an IP address with no registered DNS A record.

To evade detection, C2 framework toolkits allow easy configuration of a random factor in the beaconing delay by use of the jitter setting in a Cobalt Strike Malleable Profile:

Figure 4: C2 Malleable Profile Settings (beaconing timing)

These settings specify a call-home interval of 60 seconds +/- 15%, which means the actual interval will range from 51 to 69 seconds, evading simple checks for recurring beaconing at constant intervals.

Efficacy

The problem with current approaches is that they do not effectively detect malleable C2 communications and are easily evaded even if specifically tuned. They do serve a purpose in efficiently detecting attack techniques that are static with well-known indicators, but miss more dynamic or sophisticated attacks or create a large number of false positives.

As one data point, when testing the most common Cobalt Strike C2 Malleable Profiles from public repos, out of the box IPS solutions such as Snort and Suricata detected substantially less than 20% of the C2 communications from the most common C2 framework toolkits.

Even after specifically adding rules to match as many of the public profiles as possible, optimizing for this specific test, coverage could only increase reasonably to ~60% without introducing significant false positives that would be very problematic in a production environment.

There are numerous problems with efficacy: not only higher false positives, but also the resulting configuration is rigidly constructed for the specific test in question, and is easily evaded by slight tweaking of the profiles. And at the end of the day, there are still ~40% of the profiles which remain undetected, which is a very high false negative rate. Not to mention the additional false negatives from a determined attacker who customizes the C2 profiles to mimic existing well known applications in a slightly different way.

New Detection Approach

A more effective approach is required, based not purely on static indicators, but based on focused machine-learning models that can detect anomalies in network traffic using a multitude of network signals that indicate suspicious command-and-control activity compared to what valid applications normally do for the specific users within the specific organization. Additionally, fine-grained risk metrics should be tracked at the user level to provide the most accurate and effective mitigation actions. Innovations in these three areas are required to make large improvements in detecting stealthy C2 beaconing from the C2 framework tools:

Figure 5: New Approach C2 Beacon Detection

Comprehensive Signals

A comprehensive set of signals is required and should include source, destination, and traffic characteristics such as SSL/TLS certificates used on both source (malware inside environment) and destination (C2 server), domain/IP/URL, source characteristics such as the user agent/process characteristics, traffic size/burstiness/patterns, HTTP headers/payload/URI, to name just a few.

When looking at various signals across time, volume, network layers, and overall traffic profiling, behavioral detection can provide a general and effective mechanism for detecting the latest malware via suspicious and malicious C2 beaconing activity.

Figure 6: Comprehensive Signals

There are several dimensions to the signal types:

Network flow: source and destination attributes, as well as traffic patterns
Network layers: different signals from layer 3 to 7 (anomalies across TCP/IP headers, SSL/TLS fingerprints, HTTP headers/payloads, and application-level content)
Time: frequency, anomalous timing patterns to detect infrequent and slow activity
Data: content and volume (anomalous packet sizes, bursts, cumulative stats)

Additionally, there are several signal types:

Traffic pattern-based (volume, timing, content) including repeated beaconing in conjunction with unusual user agent or domain.
Heuristics (e.g., suspect registrars or known malicious SSL fingerprints)
Anomalies (unusual domain, user agents, or SSL fingerprints)

One important point is that some of the above signals are part of current approaches and existing solutions. This reinforces the point that a particular signal such as traffic spike (large volume) is neither good or bad, effective or ineffective by itself. Rather, the context and processing of the signal is the determining factor. When used at a network perimeter to block/allow traffic, a signal that is prone to false positives can cause severe operational problems. However, when fed into an anomaly detection system that incorporates that signal into a granular risk metric (discussed below) and a well-trained model, it can be extremely effective at detecting new threats in a robust manner with low false positives.

Anomaly Detection

Effective detection of C2 beaconing from C2 framework toolkits requires machine-learning models based on a more comprehensive range of signals to identify today’s C2 framework toolkits and future suspicious network behavior that could indicate C2 activity.

Figure 7: Anomaly Detection

Anomaly detection should be based on models at the user/device, role, and organization level. That is, anomalies presume that we have a valid “normal” baseline of activity or behavior to compare against. Detecting suspicious activity can occur under different circumstances. There are anomalies based on a user’s actions vs their historical “normal” baseline or vs the organization’s “normal” baseline or vs the individuals in similar roles. All have their valid use cases that are exclusive of the others, and a good approach will incorporate multiple models with different scopes.

Training Data

Training data sets should include both malicious and benign traffic:

Malicious traffic can be simulated using general C2 testing tools, specific C2 adversarial beaconing tests based on publicly available configurations of C2 framework tools, as well as customized configurations from a red team perspective and official red team exercises.
Benign traffic or valid traffic is best gathered from a significant number of actual users at real organizations over enough time to normalize against user and organizational biases.

Training datasets are the flip-side of testing datasets, and a lot of time should be spent analyzing and validating good training and testing data. Some of the factors in coming up with good testing datasets are discussed in a subsequent section.

Granular Risk Metrics

The output of the anomaly detection is critical. The best approach does not perform simple block/allow or alert/silent determinations based on a raw signal, but instead tracks and adjusts fine-grained risk metrics at the user, role, and organization level, which can then be used for remediation actions such as alerting, coaching, or blocking.

This approach to tracking and acting upon risk is fundamentally different than what is typically used today. Most perimeter defenses that are preventative and typically block/alert on/allow traffic are generally static and prone to high false positive rates. The net result is that these solutions are enabled with a conservative policy to block certain and known risks, which then opens up a large number of false negatives. With firewalls, we see false positive problems with overly-aggressive block actions based on IP threat intelligence. With IPS solutions, we’ve been discussing the false positive challenges with static signatures trying to detect highly-configurable and dynamic C2 traffic.

But false positives at a perimeter layer can be very useful as a signal to a more intelligent layer. In this scenario, we would not use it for a binary assessment (allow/block, alert/ignore) but as a fine-grained risk metric adjusted over time (e.g. user risk score) with a tuned threshold before taking action. A granular risk metric, for example, a risk score of 1000 (no risk) to 0 (extreme risk) on a user or device or even IP address, allows us to model the spectrum of gray associated with real-world threats, where there are rarely clear 100% malicious or 100% benign assessments.

Conceptually, this is captured in the following illustration, where three different signals might be detected, which by themselves would be prone to false positives. However, when associated with incremental risk and evaluated by a tuned machine learning model, the same signals assess cumulative risk over time and ultimately deliver a high-confidence anomaly detection.

Figure 8: Granular Risk Metrics

Note that the signals in this example may not be as simple as static signals. For example, an “unusual domain unrecognized user agent and SSL/TLS certificate” could be an anomaly when compared against the past “normal” baseline of traffic for that specific user or against similar job roles of users or the whole organization. A “suspect registrar” may be an amalgamation of domain reputation correlated over time. And “periodic beaconing” is no longer a simple match of a fixed rate or duration, but instead can be detecting abnormal but regular and repeated activity within a time window that is similar to bot-related activity as opposed to valid application daemon callouts.

In practice, this allows us to adjust a risk score incrementally and appropriately based on a low-fidelity signal. It causes no blocking or alerting action until the cumulative risk score exceeds a tuned, high threshold. This allows us to capture cases where there are a lot of low-fidelity slightly risky indicators, which when combined with a higher-fidelity, higher-risk indicator for a particular user or device, cumulatively generates a critical risk alert and action with a drastically lower chance of false positives.

Evaluation and Testing

A new approach can theoretically be sound and in practice fail miserably, and the proof quite often comes down to data or testing. Vendors providing solutions and organizations looking for solutions require a robust approach to testing new threats and evaluating solutions. To achieve accurate results, it’s essential to test with a diverse dataset that includes both malicious and benign traffic.

Benign Traffic
Benign traffic should be realistic, comprehensive, and similar to production in terms of the number of users and activity. Good traffic, often user-dependent, should be studied across a large sample of users and a reasonable time period. This test dataset will measure false positive (FP) rates. The key variance in the datasets will be client signals such as the applications, user agents and client SSL/TLS certificates being used, destination signals seen in the destination domains/IP addresses, and traffic pattern signals in the headers, payload, size, and timing.

The good news is that good benign traffic is easily gathered from the day-to-day operations of the organization’s users, while the bad news is that it has to be validated as benign. The practical approach is to statistically sample the benign traffic to a certain reasonable confidence factor, then spend the majority of time focusing on the alerts from the C2 detection solution being tested, verifying the alerts as true positives or false positives. In other words, upfront sample and verify to form a baseline, assume the benign dataset is clean, then proceed to identify false positives based on testing.

Figure 9: Benign Traffic Testing

Malicious Traffic
Using public profiles from popular C2 framework tools provides a solid foundation for testing bad traffic. These profiles represent practical, frequently used configurations that evade defenses, and help measure false negative (FN) rates. However, a lot of consideration must be given to coming up with a representative “bad traffic” dataset, as there are potentially multiple levels to the coverage and what the datasets test, as illustrated in the following diagram:

Figure 10: Malicious Traffic Testing

Breach and attack simulation tools such as SafeBreach are excellent for coverage testing and repeated testing. Their C2 test cases typically include at least some simulation of C2 frameworks activity. The advantage is that a large breadth of functionality is available including general malware attacks, with well-designed GUIs and architectures, and repeatable testing procedures and reports. These tools can give you a breadth of scenarios including: low-slow activity, to IaaS/CSP infrastructure, HTTP and non-HTTP traffic to SSL/HTTPS traffic and spoofing of a variety of user agents.
C2 framework tools (public profiles). In-depth testing of C2 frameworks requires focused work. One approach is to create a test dataset based on the specific public profiles of C2 framework tools e.g., Cobalt Strike. These public malleable profiles tend to be shared widely and used by many users and malicious actors since they include useful emulations of benign applications such as gmail. This approach provides typically more comprehensive testing around the specific C2 frameworks.
C2 framework tools (custom profiles). Internal customization of the C2 Malleable Profiles can provide even more realistic testing of C2 frameworks. These custom configurations can be done during internal red team operations. This requires more work and investment as red team operators must be fluent in the C2 framework tools.
Realistic attacks. The most realistic testing involves black-box testing with external pen-testing or bug-bounty programs. In these scenarios, the requirements of the exercise are carefully constructed to require or incent actual POC exploits that use specific C2 framework tools or any C2 beaconing behavior, with caveats of avoiding detection over a period of time. The goals of the exercises are not only to test initial access vectors which is the norm, but to focus on post-breach activity by showing an ability to install a backdoor payload with demonstrated C2 activity. This enriches the test dataset beyond the C2 frameworks and can test backdoor POC code with custom C2 communications, and is also an excellent test of any detection tool’s resilience to a skilled “attacker” utilizing different or custom TTP.

Testing can involve some or multiple approaches, but there should be explicit choices made on how to create, gather, and validate the test datasets and how to measure expected outcomes. Creating and collecting the test datasets is very important so that testing can be automated and easily repeated.

It is also crucial to measure complete metrics during testing: true and false positives, true and false negatives. While collecting all metrics sounds obvious, being precise in definition and clear and repeatable in measurement methodology is difficult, leading to misleading results.

False Positive and False Negative Targets
With the new evasive threats created by C2 frameworks, any newer detection solutions will lack widely-accepted FP and FN rates. However, it is vital to create FP and FN targets. With known quality test datasets, baselines can be created for the current environment and users/devices, which then allows reasonable targets to be created, relative to those baselines.

For example, suppose an organization with only an IPS is starting an evaluation of new C2 detection solutions and it’s unclear what FP/FN rates are acceptable. The organization can still come up with reasonable targets by following a testing methodology such as:

Create quality test data for benign traffic based on production data and malicious traffic based on for example, public C2 malleable profiles for Cobalt Strike, and by validating samples of those datasets manually.
Create a clear, repeatable test methodology by defining testing and measurement tools.
Measure all metrics (TP/TN/FP/FN) during testing
Test new solutions and compare metrics. For example, the IPS could be tuned specifically for better TP rates for the malicious traffic, but ensure that FP/TN/FN rates are also measured and validated. Then the efficacy of different solutions can be properly evaluated, especially in the total impact to the organization as described in the Impact section below.
Test new datasets and compare. Look to customize the datasets to reflect reasonable adjustments by an attacker. There are several ways to do this.
- For example, when testing Cobalt Strike, its C2 Malleable Profiles can be easily modified to emulate benign apps a little differently or completely new benign apps that are used within the specific organization. This can be done with sniffing HTTP/S outbound traffic via a proxy.
- Test not just one but multiple C2 framework tools, as they differ in capabilities and techniques. Using a different C2 framework tool is also a good change as its C2 traffic shaping will be different.
- Creating a custom test payload with its own hand-coded C2 communications is yet another way to change the test datasets but requires the most time and investment.

Resilience Testing
By testing with new datasets across different solutions, we also gain valuable insights into the rigidity vs resilience of different solutions. In this paper, we have raised the issue that hard-coded and signature-based approaches are not only less effective for detecting C2 frameworks but they are also rigid, resulting in high FP/FN rates and allow easy bypass with simple attack changes, like malleable profile changes.

Resilience of any solution can be tested by ensuring that the datasets are modified within reason (i.e., staying within the same TTP category). In other words, we can perform a realistic resilience test by changing the C2 communications within the malicious traffic dataset using C2 malleable profiles and monitoring the TP/TN/FP/FN rates. We see how the coverage varies, and we also understand what changes in the detection solution need to be made to maintain coverage for particular TP/TN/FP/FN targets.

Retesting with changed datasets in this way is analogous to an attacker changing their TTP. It evaluates the resiliency and effectiveness of the new detection solution as we can see if it is still able to detect changes within the same threat technique category (C2 communications over HTTP/S).

False Positive and False Negative Impact
Measurement of FP/FN rates is good and allows for relative improvement, but we also need to measure or at least estimate the impact of FPS and FNs, otherwise it’s impossible to evaluate the true usefulness of any detection solution. In other words, a 1% FP rate or an improvement of 5% in the FP rate has no context unless we can measure the impact of that 1% or +5% in some way that makes sense to security budget decision-makers.

Here are two approaches that can help translate TP/TN/FP/FN rates into more quantifiable impact:

User impact over time: Look at the absolute number of false positives that are equivalent to the FP rates and normalize as a rate per user over time. This is a qualitative measure but often makes more sense than % rates or absolute numbers. For example, rather than 1% FPs or 2,437 false positives, it may be easier to assess impact from 0.1 false positives per user per day. If this were a secure web gateway, someone in the organization could make a determination of whether a certain FP target is acceptable based on the user impact over time. In this case C2 framework-enabled malware results in breaches, and the user impact is more characterized as downtime or data loss per user over a time period. We have a N% chance of $X loss per user every year. These are often rough estimates, but any start is useful as it can be revised and improved with regular iteration. If impact is assessed in terms of users over time, then it becomes easy to evaluate detection or protection solutions, which often are priced on the number of users per year.
Security Operations impact in terms of time, money, and breach probability. Besides end user impact, administrative impact should be assessed, particularly operations people who often spend time on handling detection alerts. The time spent on responding to noisy alerts can be translated directly into FTE salary cost. The additional factor of alert fatigue is a real impact that can be estimated in terms of efficacy (time to respond) and more importantly as the loss of time and attention to truly higher-impact threats that are lost or not investigated. This latter impact becomes a factor in breach impact–breaches are more likely to happen when security operations have too many false positives to investigate and delete.

An impact assessment is often the only way to understand crucial insights such as the true cost from detection efficacy. For example, an overly aggressive detection solution with a configuration of low FNs and high FPs is useless and harmful because security operations waste an inordinate amount of time responding to low-fidelity alerts instead of engaging in higher-leverage activities. Likewise, an overly conservative detection solution with low FPs but high FNs, exposes the organization to high risk for a potential breach, which can be unacceptable from an overall risk assessment view.

Impact should be estimated and assessed at the same time as core TP/FP/TN/FN metrics.

Realistic Testing

Red team with humans, not just automated breach or pen test tools. It is highly recommended to use not only production users and environments for testing C2 beaconing solutions, but also use realistic adversarial scenarios such as pen testing or bug bounties. By adjusting bounty amounts and requirements to show the explicit implantation and successful post-exploitation actions of popular C2 framework toolkits, we can make the “malicious traffic” real and measurable. This could be widened to any C2 beaconing activity including custom code to test resiliency of the detection solution, and the testing requirement should include demonstration of successful daily beaconing activity and command execution over a week, without detection.

If an external pen test or bug bounty program is repeated, the differences in detection rates will be measurable and useful for evaluating efficacy and ROI.

With a rigorous approach to testing, not only will testing efficacy be measured comprehensively, but ongoing targets and goals can be created relative to a current/historical baseline. And certainly, if the same testing and measurements are performed for multiple solutions, it is trivial to compare performance and make buying/implementation decisions.

Design Considerations

Research and design involving these concepts and overall approach are discussed in more detail in: Security systems and methods for detecting malleable command and control (Mulugeta).

Benefits

Anomaly Detection of New Unknown Threats

This approach effectively mitigates unknown threats by leveraging machine learning models trained on application behavior unique to users within an organization. The granular user risk metric significantly reduces false positives.

In contrast, existing reactive approaches rely on identifying a first victim or patient zero (a sacrificial lamb for the greater good), followed by vendor analysis and research that can take days or even months, before a vendor releases a new signature or rule to block the new threat for those customers who have not yet been attacked. By design, this approach is ineffective for blocking new and emerging malleable threats.

An anomaly detection approach, leveraging specific and tuned machine-learning models, can uniquely detect suspicious behavior without requiring an analyze-release-update cycle. The approach maintains robustness even as threat tactics evolve.

Comprehensive Signal Analysis

Anomaly detection across a comprehensive set of signals such as time, volume, TCP/IP comms, SSL/TLS fingerprinting, application protocol payloads, can effectively detect sophisticated C2 malleable communications.

Adversary Toolkit Detection

This approach can effectively detect the use of the latest C2 framework tools and C2 frameworks, as well as new and suspicious C2 beaconing activity, by relying on anomaly detection using a breadth of network signals specific to the users in the environment and comparing against valid, benign traffic in the environment.

Detection Efficacy

Current approaches (IPS signatures + IP/domain/URL blocks) miss a high degree of the advanced C2 communications in the latest malware (40% up to 80% depending upon test scenarios).

Using a new approach with a tuned machine learning model, anomaly detection of a rich set of signals, and a granular risk metric, more than 85-95% of these currently evaded attacks can be detected.

This results in an overall 95%+ true positive detection rate with minimal false positives.

Conclusion

C2 framework toolkits have empowered attackers with sophisticated techniques to evade command-and-control (C2) detection. Notably, widely-available toolkits like Cobalt Strike, Brute Ratel, and Mythic are accessible either as open-source or hacked/stolen commercial code.

Traditional static approaches, which heavily rely on static signatures and indicators such as IP/URL block lists, face severe limitations and are easily circumvented by these evolving threats.

To address this challenge, a fundamentally different approach is required that leverages machine-learning models. These models incorporate a comprehensive set of network signals, specifically trained at both the user and organizational levels. Additionally, they utilize fine-grained user risk metrics to lower false positives and measure the gray often associated with threats.

The efficacy of machine-learning approaches should be carefully evaluated by users. Rigorous testing against a robust testbed of malicious and benign traffic is essential to determine their effectiveness in detecting and mitigating these new threats.

References

“Cobalt Strike: International law enforcement operation tackles illegal uses of ‘Swiss army knife’ pentesting tool.” The Record from Recorded Future News, 3 July 2024, https://therecord.media/cobalt-strike-law-enforcement-takedown. Accessed 23 August 2024.

Gill, Andy. “Understanding Cobalt Strike Profiles – Updated for Cobalt Strike 4.6.” ZSEC Blog, 13 April 2022, https://blog.zsec.uk/cobalt-strike-profiles/. Accessed 23 August 2024.

Larson, Selena, and Daniel Blackford. “Cobalt Strike: Favorite APT Tool to Crimeware.” Proofpoint, 29 June 2021, https://www.proofpoint.com/us/blog/threat-insight/cobalt-strike-favorite-tool-apt-crimeware.

Mudge, Raphael. “Malleable C2 Profiles: gmail.” Malleable-C2-Profiles normal gmail.profile, rsmudge, 28 Feb 2018, https://github.com/rsmudge/Malleable-C2-Profiles/blob/master/normal/gmail.profile.

Mulugeta, Dagmawi. “Security systems and methods for detecting malleable command and control.” Security systems and methods for detecting malleable command and control, Free Patents Online, 20 August 2024, https://www.freepatentsonline.com/12069081.html.

Rahman, Alyssa. “Cobalt Strike | Defining Cobalt Strike Components & BEACON.” Google Cloud, 10 December 2021, https://cloud.google.com/blog/topics/threat-intelligence/defining-cobalt-strike-components/.

Snort. “SID 1:25050.” Snort Rule: MALWARE-CNC Win.Trojan.Zeus variant outbound connection, Snort, https://www.snort.org/rule_docs/1-25050.

“SolarWinds Supply Chain Attack Uses SUNBURST Backdoor.” Google Cloud, 13 December 2020, https://cloud.google.com/blog/topics/threat-intelligence/evasive-attacker-leverages-solarwinds-supply-chain-compromises-with-sunburst-backdoor/.