Azure Policy improvments for log collection

Logs from Azure resources can be collected using Diagnostic Settings and sent to a Log Analytics workspace for monitoring. A Diagnostic setting on a resource defines what logs to pick up and where to forward them. If your Azure environment is small, you can configure log collection manually, but as your cloud presence grows, you will require a more scalable and standardized log collection solution. You may use Azure Policies to not only configure log collection in a scalable manner, but also to ensure that any newly created resources are appropriately monitored.

Microsoft already has policies in place for a number of resources that can be used for log collection. An Azure Policy can be used to evaluate Azure resources, check their Diagnostic settings, and if the logs are not collected as expected (non-compliant), it can configure the Diagnostic settings on the specified resource for us. Because an Azure Policy will only configure the Diagnostic settings on non-compliant resources, correctly distinguishing between compliant and non-compliant resources is critical. Unfortunately, not all Azure resources have policies yet, and even the existing ones cannot satisfy all the needs.

At BlueVoyant, we actively create policies for various purposes. This article demonstrates a few examples of how an existing policy can be improved and how one can create a multi-purpose policy that can fit any environment. The article will not go over the fundamentals of Diagnostic settings and Azure Policies.

To demonstrate how some parts of Diagnostic settings might be enhanced, we are utilizing the built-in ‘Configure diagnostic settings for File Services to Log Analytics workspace’ (version 4.0.0).

Improvements

It is critical to appropriately assess whether or not a resource is compliant since only non-compliant ones can be automatically configured by a policy. In our scenario, non-compliance means that the Diagnostic settings are not configured as expected and the required logs are not collected from a resource as intended. A policy will configure only non-compliant resources to forward logs to Sentinel.

We can forward the following events and metrics in the case of an Azure Storage File Service resource:

  • Logs – StorageRead
  • Logs – StorageWrite
  • Logs – StorageDelete
  • Metrics – Transaction - I will ignore metrics in the post.

This is how the original policy determines whether the resource is compliant or not. This is only a small part of the original policy, but all the changes regarding the policy behavior will be made here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
type: "Microsoft.Insights/diagnosticSettings", 
name: "[parameters('profileName')]", 
existenceCondition: { 
    allOf: [ 
        { 
        field: "Microsoft.Insights/diagnosticSettings/logs.enabled", 
        equals: "[parameters('logsEnabled')]" 
        }, 
        { 
        field: "Microsoft.Insights/diagnosticSettings/metrics.enabled", 
        equals: "[parameters('metricsEnabled')]" 
        }, 
        { 
        field: "Microsoft.Insights/diagnosticSettings/workspaceId", 
        equals: "[parameters('logAnalytics')]" 
        } 
    ] 
}

We are going to discuss three things here:

  1. The ‘name’ parameter (first highlight)
  2. The ‘logs.enabled’ filter (second highlight)
  3. Modifying the logic to create a multi-purpose policy

The name parameter

The ‘name’ argument can be found in the code in the following line:

1
2
3
4
5
type: "Microsoft.Insights/diagnosticSettings", 
name: "[parameters('profileName')]", 
existenceCondition: { 
  ...
}

The name parameter indicates that a resource will be compliant only if it has a Diagnostic setting with the exact same name configured on it. Having the same configuration but with a different name will result in a non-compliant resource. While this may be sufficient in some cases, it severely limits the policy’s applicability.

  • What if a resource is already properly configured but the Diagnostic settings have a different name? In this case, since the name is different, the resource will be marked as non-compliant. You will also be unable to fix it automatically, thus this resource will stay non-compliant.

  • It’s possible that someone else has already configured this resource before you. Assume you are an MSSP analyzing a client’s log collection process. The name mismatch is very likely to appear in these circumstances. Even if everything is collected correctly, you will receive a non-compliant resource due to the different names which will make your log collection evaluation inaccurate.

Usually having the proper configuration with a different name is acceptable and should be tagged as compliant, so, the recommendation is not to have the ‘name’ filter in the code. Thus, for my example policy I remove this parameter.

Using the ‘name’ parameter can be useful in specific scenarios and its removal can introduce some other problems. However, by improving our code we also resolve the issues created by this step.

The logs.enabled filter – more granular filtering

Many Azure policies, as well as those available on the internet, are all-or-nothing in nature. You can gather all the logs or none of them. Furthermore, the evaluation of a resource’s compliancy typically occurs with this all-or-nothing behavior in mind. In this case either all StorageRead, StorageWrite, StorageDelete logs are collected or none of them.

Instead of all-or-nothing, the code can be changed to provide more granular evaluation and logging options. With a little modification you could choose to collect only StorageRead events but not StorageWrite and StorageDelete ones. In this case, it is also important to properly evaluate a resource and only mark the relevant ones as non-compliant.

This is the code that evaluates the log (event) collection of a resources in the Microsoft policy:

1
2
3
4
5
6
7
8
9
existenceCondition: { 
    allOf: [ 
        { 
        field: "Microsoft.Insights/diagnosticSettings/logs.enabled", 
        equals: "[parameters('logsEnabled')]" 
        }, 
      ...
    ] 
}

The code ‘Microsoft.Insights/diagnosticSettings/logs.enabled’ returns with a ‘true’ evaluation if at least one element in the Logs category is enabled. So, if StorageRead is enabled, then this code snippet will result in a ‘true’ evaluation regardless of whether StorageWrite or StorageDelete is enabled or not. So, only one of them being enabled is enough to have a ‘true’ outcome.

Assume you wish to enable StorageRead and StorageWrite on your resource, but only StorageRead is enabled right now. In this situation, the policy will consider the presented resource as compliant due to how logs.enabled works - which evaluates the resource as compliant if at least one log type is enabled for collection. So, this is a limitation in the current code. Also, since the resource will be tagged as compliant you will not even be aware that the desired logs are not been gathered.

To improve the policy, we must add three separate parameters for each log types instead of the single ‘logsEnabled’ parameter. The ‘logsEnabled’ parameter is used in the original code to define whether you want to collect all the logs or you don’t want anything. After this change we have to implement a distinct check for each Log type. Rather than a single check that returns true when at least one log type is enabled the new code checks all the log types (read, write, delete) individually and compares their value with what we configured in the policy. We may accomplish this evaluation in the following manner:

{ 
  "count": { 
    "value": [ 
      { 
        "category": "StorageRead", 
        "enabled": "[parameters('StorageRead')]" 
      }, 
      { 
        "category": "StorageWrite", 
        "enabled": "[parameters('StorageWrite')]" 
      }, 
      { 
        "category": "StorageDelete", 
        "enabled": "[parameters('StorageDelete')]" 
      } 
    ], 
    "name": "logTypes", 
    "where": { 
      "count": { 
        "field": "Microsoft.Insights/diagnosticSettings/logs[*]", 
        "where": { 
          "allOf": [ 
            { 
              "field": "Microsoft.Insights/diagnosticSettings/logs[*].enabled", 
              "equals": "[current('logTypes').enabled]" 
            }, 
            { 
              "field": "Microsoft.Insights/diagnosticSettings/logs[*].category", 
              "equals": "[current('logTypes').category]" 
            } 
          ] 
        } 
      }, 
      "greater": 0 
    } 
  }, 
  "equals": 3 
}

Instead of verifying whether any of the event types are enabled, we check each one individually. If all of them have the value we specified on the parameters page (in the policy), we will obtain three ‘true’ results (in this example, due to the three log types in a File Service resource).

So, if we only wish to send StorageRead logs and both Read and StorageWrite are enabled, the resource is now classed as non-compliant, when it was previously marked as compliant. Collecting more or fewer log categories than those specified in the policy will result in a non-compliant outcome.

The table displays what log collection settings are specified on the resource, what we configured in the policy (what we want), and whether the particular log type is compliant or non-compliant based on these settings. The last entry in the ‘Compliant?’ field is ‘True’ if StorageRead, StorageWrite, and StorageDelete are all compliant; otherwise, it is non-compliant. This final entry indicates whether or not the entire resource will be tagged as compliant.

Resource Diagnostic settings Policy configuration Compliant?
StorageRead True True True
StorageWrite True False False
StorageDelete False False True
False

Create a multi-purpose policy

Using the modified code, if we want StorageRead logs only and both StorageRead and StorageWrite are enabled the resource will be tagged as non-compliant, because we do not want StorageWrite logs.

This is the expected behavior, but in some situations, we want something different. Typically, there are 2 use cases of these type of policies when we want to onboard some logs to Sentinel:

  1. Onboarding with a cost optimization mindset: When we do this, we want to ensure that only the necessary logs are forwarded. The policy above can already do this. If we want only Read logs but Write logs are forwarded as well the resource is going to be marked as non-compliant.

  2. Onboarding from a coverage perspective: We frequently work with clients who already have a (semi-)configured Sentinel instance in place and logs are already forwarded. TThey want to ensure that all of the logs we want for security are present, but they are comfortable with having extra logs or expressly want to forward other event types as well. So, even if we just want Read logs and both Read and Write logs are enabled on the resource, we still want to label them as compliant, because the logs we need are collected and having more logs is fine in this scenario. In this case, we want to check the coverage but do not want to remove the unnecessary logs.

To fulfill both eventualities with a single policy, we may add an additional filter to the policy’s ‘existenceCondition’ section. Let’s call this new parameter EvaluationMethod. EvaluationMethod can be set to one of two values: ‘Coverage’ or ‘Cost Optimization’.

If ‘Cost optimization’ is selected, the policy will function as it has in the past. So, in the ‘Cost Optimization’ mode, we only label as compliant the resources that are transmitting the exact logs that we configured in the policy. In case of ‘Coverage’ the policy will disregard the log types which are marked as ‘False’. So, in ‘Coverage’ mode ‘False’ means we don’t care whether the given logs are collected or not. In other words, a resource is compliant if it collects at least the logs we want, but there may be more log types collected on top of our requirement.

So, we go through all the log types during evaluation. The policy evaluates the resources based on this logic:

  • Return ‘True’ if the log collection is configured in the same way as our policy specifies for the given log type. For example we want StorageRead and it is collected, or we don’t want StorageWrite and it is not collected.
  • Returm ‘True’ if the evaluation method is ‘Coverage’ (which means more logs are fine) and the policy for the log type is ‘False’ (which means we don’t care whether it is collected or not).
  • Return ‘False’ otherwise.

Here is an example with the previous settings showing both modes:

Resource Diagnostic settings Policy configuration Cost Optimization - Compliant? Coverage - Compliant?
StorageRead True True True True
StorageWrite True False False True
StorageDelete False False True True
False True

Here is the modification of the evaluation logic (showing only the relevant parts):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
"allOf": [{ 
      "anyOf": [{ 
            "field": "Microsoft.Insights/diagnosticSettings/logs[*].enabled", 
            "equals": "[current('logTypes').enabled]" 
        }, 
        { 
            "allOf": [{ 
                  "value": "[parameters('EvaluationMethod')]", 
                  "equals": "Coverage" 
              }, 
              { 
                  "value": "[current('logTypes').enabled]", 
                  "equals": false 
              } 
            ] 
        } 
      ] 
  }, 
  { 
      "field": "Microsoft.Insights/diagnosticSettings/logs[*].category", 
      "equals": "[current('logTypes').category]" 
  } 
] 

Please keep in mind that you will need to make some further changes to the code in order for it to work, such as adding new parameters and utilizing those parameters. The code above merely demonstrates the changes to the logic. However, you can see from the code comparison that I not only changed the evaluation part of the code (removing the name parameter and adding some evaluation logic), but I also made a lot of smaller modifications to be able to leverage the new arguments and features.

The code is simply used to show some of the possibilities that a policy can offer. The created policy is not being used by BlueVoyant in this form.

Find the policy on this link !!! or check the comparison of the original and the modified code below: Policy comparisong