RFC: Firewall use case

An "RFC" with a collection of use cases, feature ideas, and opportunities related to using Suricata in a firewall role.
3 months ago · c940baa6cd
parent bd7d38e91e
commit c940baa6cd
1 changed files with 108 additions and 0 deletions
--- a/rfc-suricata-firewall.md
+++ b/rfc-suricata-firewall.md
@ -0,0 +1,108 @@
+# Firewall use case RFC
+
+Our organization is a large user of Suricata in a firewall role, which has given us a detailed view into some of the specific difficulties that arise when using an IPS for this purpose.  With the "improve firewall usecase" theme in [Suricata 8's roadmap](https://tender-impala-f95.notion.site/Suricata-8-0-0-Roadmap-39ead367c91748c5afcc8acb2753c9fc) we see an opportunity to contribute some of these experiences back, which we hope will be useful and allow Suricata to excel in this role too.
+
+## Primary themes: IDS vs firewall rulesets and expectations
+
+Firewall operators generally have rulesets and expectations that are similar to each other, but may differ from the typical IDS user's rulesets and expectations.  Since many of the more-specific improvement areas in the following sections ultimately stem from this difference, in this section I will try to outline what I see as some of the main features of how these use cases differ:
+- **Default deny** refers to a rule structure where specific, enumerated types of traffic are allowed, and all other traffic is blocked by default.  It is the accepted best-practice structure for firewall rulesets and as a result is very common among firewall users.  Today it is possible to construct Suricata rulesets using this pattern using `pass`, `drop`, and `reject` actions, but certain common rule combinations remain as a source of difficulty, often when the user's intent is to allow traffic using application-layer rules and block all else using a lower-layer `drop` or `reject`.
+- **Explicit rule intent** refers to the ability of a rule author to express their intent directly, and is especially valued by firewall users to ensure security needs are met and due to the implications of default deny: when rules are evaluated in a different way or order than intended, it can result in all or parts of connections unintentionally matching against catch-all default deny rules, which impacts connectivity.  This results in a usability challenge firewall rule authors experience - there are several specific contributing factors, but a common theme among them is its freedom to "interpret" rules in certain ways.  The way that Suricata interprets a ruleset is influenced by factors that users may not foresee, which leads to a surprise when a small rule tweak has a much larger blast radius than expected.
+- **Rule update safety** is a theme that we encounter often: since firewalls are critical-path inline devices controlling network access, users need processes for managing the ruleset changes in a way that limits risk to uptime.  For firewall operators, ways to test or "preview" the effects of rule updates before they are applied to live traffic is especially valuable in order to avoid either network downtime or unintended exposure caused by rule changes that don't behave as expected.  The problem of some rule updates having a larger-than-expected blast radius mentioned above is exacerbated when there isn't a safe way to preview rules first.
+- **Fail-closed** is a desirable and important property for firewalls.  The goal of a firewall is to only allow traffic that it can positively verify is compliant with its ruleset, so the possibility of any fail-open scenario is a security problem for firewall users.  Suricata features like exception policies and the midstream-policy have helped greatly here, but it remains an area that is continually at the front of our minds due to security implications.
+
+## Examples of specific suggestions
+
+Collected here are some examples of specific suggestions based on our firewall use-case that we face often, in no particular order.  Many of the specific items here are not independent and are related to others in the list, and all exhibit some mixture of flavors of the primary themes above.  Some of these offer possible solution ideas, but these are not necessarily the only way to solve the problems and can be thought of as more of a starting point than a concrete proposal.
+
+### Previewing rule updates
+
+To allow users to build safe processes for updating firewall rules that operate on live production traffic, some abilities to preview or test the effects of rules before applying them would be very useful.  I don't think there is a single solution to this, but instead offer some perspectives and ideas that have arisen over time related to this:
+- One best-practice for introducing functional changes to a production environment involves introducing changes in a "count only" form first.  For example when introducing a new "drop" or "reject" rule, this could come in the form of an ability to first insert that rule in a mode that increments counters or logs when the rule is matched, but does not actually perform the "drop" or "reject" action on traffic.  Either (or both) of log or metric counter visibility could be used for this, and would allow users to adopt an update process where they first insert a new rule in a harmless count-only mode and verify that it matches the traffic they expect before switching its action on.  It is important that not applying the final action to the packet or flow is the only functional difference between the "count" version of a rule and the real one - we sometimes refer to this type of feature as a "count mode" or "shadow mode".  [Exception policy stats counters](https://redmine.openinfosecfoundation.org/issues/5816) is one example of this type of feature that allows exception policy changes to be introduced safely, but a similar issue exists for rule updates too.
+- There is a related but different issue of removing rules - firewall operators need the ability to maintain their rulesets by pruning or removing old rules that are no longer needed, but with large, complex, and old rulesets they are not always easily able to identify which rules are actually still being used.  Firewalls commonly solve this problem with a "rule hit count" feature, which is a simple per-rule counter that increments each time the rule is applied to matching traffic.  Combined with the ability to zero or reset the counters, this gives firewall administrators the tools they need to monitor and identify rules that are no longer needed.
+
+Two step "commit-confirm" update pattern supported by some other firewall appliances is another example of a solution to a similar issue, where committed rule changes are scheduled to automatically revert after some timeout unless a second "confirm" action is taken within that time, which causes any impact caused by a rule update to resolve itself quickly and prevents firewall administrators from being able to lock themselves out.
+
+### Rule terminating behavior
+
+The "terminating behavior" of rules is something that is typically well-defined for firewalls but is less so in Suricata, and refers to what happens to a given packet/flow/connection *after* a rule is matched.  A rule is said to be "terminating" if it stops any further evaluation of rules, and "non-terminating" if matching that rule still allows other rules to be evaluated after.  Suricata doesn't explicitly define this but the table below is my attempt at reconstructing something similar - it's somewhat unintuitive especially because the effective terminating behavior of logging is different than that of applying rules to traffic:
+
+| Rule action | Terminates applying additional rule actions?           | Terminates logging? |
+| ----------- | ------------------------------------------------------ | ------------------- |
+| `pass`      | Yes                                                    | Yes                 |
+| `drop`      | Yes, but a later `reject` rule can still reject it too | No                  |
+| `reject`    | Yes                                                    | No                  |
+| `alert`     | No                                                     | No                  |
+
+As an example, today the following will create duplicate log entries for traffic from 1.2.3.4 -> 4.3.2.1:
+
+```
+drop tcp 1.2.3.4 any -> 4.3.2.1 any (sid:1)
+drop tcp 1.2.3.4 any -> any any (sid:2)
+```
+
+Although the `sid:1` rule is the one that applies and drops this traffic, rule evaluation continues and the `sid:2` also matches for the purposes of producing logs only, which makes it difficult to use log outputs to understand the behavior of rules, although the [packet verdict](https://redmine.openinfosecfoundation.org/issues/5464) feature has helped.  Users are sometimes tempted to switch the rule actions (e.g. from drop to alert) as a way of previewing rule updates before they are enforced (see the topic above), but this will not produce representative results due to the effects of both the terminating behavior and the action-ordering (see the next topic).
+
+Similar to some other examples here I think some ability to explicitly define the desired terminating behavior might help.
+
+### Action order
+
+Suricata [reorders its ruleset by action](https://docs.suricata.io/en/latest/configuration/suricata-yaml.html#action-order) and other properties, which has implications for almost all of the primary themes above (especially default deny and explicit intent) and is a factor in several of the other issues here.  To help mitigate this, we maintain a patch on our own Suricata builds which introduces a configurable option to disable it and evaluate rules in the order they are written.  We would be happy to either work together to upstream our patch, or request and adopt another version of it if it were to become officially supported.
+
+### Backwards compatibility
+
+In our deployment the owners/administrators of firewalls and the owners of the rulesets are administratively separate, which sometimes introduces a need to coordinate when version upgrades happen.  Our ideal is for the upgrade process to be as transparent as possible to the owners of the rulesets, but they sometimes do need to become involved when upgrades change rule functionality in backwards-incompatible ways.  There are two main types of backward incompatibility that we have encountered:
+
+- In some cases the syntax of existing rules becomes invalid with the upgrade, which results in a failure with an error/warning at the time the rules are loaded.  This type of backwards-incompatibility introduces an additional step to the upgrade process that we would prefer to avoid but is still more manageable than the second type:
+- The second type of backwards-incompatibility is more subtle and happens in cases where rule syntax remains valid but its meaning changes.  Existing rules continue to be valid syntactically but the upgraded Suricata now interprets them differently which impacts which traffic matches the rules.  
+
+Both of these factors impact how quickly we are able to roll out new Suricata releases, but the second one is where we see more potential for improvement today to reduce the effort required for the process of identifying subtle incompatibilities.
+
+One idea that would help in this area relates directly to the "explicit intent" theme - I think that if the rule language allowed users to more directly express their intent then it would help in this area by constraining the engine a little bit in its freedom to interpret them.
+
+Separately, any tooling, capability, or process that could identify where behavior has changed based on analysis of a ruleset would help greatly with automating the process of identifying incompatibilities.  Something like the `--engine-analysis` feature could help form a maintainable solution to this problem if its output was guaranteed to be stable across releases, only changing when Suricata's internal interpretation of the ruleset changes.
+
+### Packet vs flow actions
+
+One of the most common sources of complexity when authoring firewall rules ultimately stems from the difference between rules that apply to packets and rules that create flow state and apply to flows.  Suricata supports both types but does not expose the distinction, using a combination of multiple rule attributes to influence the action type associated with a rule.  I think this is one of the areas where some targeted work has the most potential to improve the experience of using Suricata as a firewall.  This challenge is closely related to the default-deny structure of most firewall rules, where setups often require multiple rules to match and allow different parts of connections, and any stray packets that are not explicitly allowed are subject to being blocked by a catch-all default deny rule.
+
+Resources that help firewall rule writers understand how rules are evaluated by the engine could be one way to help make this process easier.  Firewall rule writers often are able to piece together working rulesets experimentally with testing, but would be in a better position with a baseline understanding of rule processing that would let them author rules that work predictably without the trial and error process.  For example in support channels we often see a recommendation being repeated to add a `flow:to_server` option to rules as a solution to common challenges, but with limited understanding of *why* that works (which is because it turns an iponly rule into a `SIG_TYPE_PKT`).  This experience ties into themes mentioned earlier around small rule changes having unexpectedly large effects by dramatically changing the behavior of a ruleset, and difficulty in safely rolling out rule changes without a safe way to preview their effects first.
+
+I think one area that would help with this is documentation of the signature types, their action effects, and the factors that influence which type a given signature will have.  The table below is an incomplete, at least partially incorrect example that I have pieced together for illustration purposes.  I think a comprehensive version of something like this would go a long way toward helping rule writers:
+
+| Attributes influencing the signature type             | Signature type       | Applies to flow?        |
+| ----------------------------------------------------- | -------------------- | ----------------------- |
+| ?                                                     | SIG_TYPE_NOT_SET     | No                      |
+| Layer 3/4 protocol matching on IP & ports only        | SIG_TYPE_IPONLY      | Yes                     |
+| ?                                                     | SIG_TYPE_LIKE_IPONLY | Yes                     |
+| ?                                                     | SIG_TYPE_PDONLY      | Yes                     |
+| ?                                                     | SIG_TYPE_DEONLY      | No                      |
+| Used for most signatures that include "flow" options? | SIG_TYPE_PKT         | No                      |
+| ?                                                     | SIG_TYPE_PKT_STREAM  | Yes, if content matches |
+| Signatures that match on content?                     | SIG_TYPE_STREAM      | Yes, if content matches |
+| Application layer signatures?                         | SIG_TYPE_APPLAYER    | Yes                     |
+| ?                                                     | SIG_TYPE_APP_TX      | Yes                     |
+
+Existing features like `--engine-analysis` are valuable tools here and can output a number of useful insights including the signature type directly, but it's still difficult for users to interpret the analysis results without documentation of what the outputs mean.  I find the JSON output of the analysis more detailed and useful than the readable text, but without any reference on what the output fields mean I still need to piece together much of the puzzle myself.
+
+Another idea that I think has a lot of potential to improve the experience is related to supporting more explicit intent directly in the rule language, where rule writers could specify (for example) `pass-flow` or `pass-packet` instead of `pass`, whose interpretation can change.  These more explicit indicators of intent would place some constraints on Suricata's possible interpretations and would both make these rules easier to write and understand, and would also contribute toward solving the backward-compatibility problem mentioned above where Suricata's interpretation of a signature changes subtly between releases.  This is an area where I want to be careful not to prescribe a particular solution as there are many possibilities to explore, but in general some way to allow signature writers to express their intent would help a lot.
+
+### Application layer default deny
+
+Default-deny rule structures specifying application-layer allow conditions are common, but they remain complex for users to get right ultimately because the application-layer details don't reveal themselves until later in the connection.  So, these default-deny configurations need to mix both application-layer and network-layer rule conditions in a way that both requires a full depth of understanding of the network protocols involved and is still sometimes challenging to maintain.  There are two basic rule structures we see being used for these, whose difference is defined by the way the default-deny action works:
+
+- One structure uses a default `drop` or `reject` action matching on connections that are in an established state.  Individual `pass` rules that match connections before this point are used to allow specific traffic, and the "deny established" catch-all rule then blocks all connections that progress to established state without being explicitly allowed.  This approach is generally easier for simpler use cases, but is not suitable for others as it can be more permissive than necessary, and can present a challenge because some application layer protocols become "established" at a later point in the connections than the underlying TCP transport layer does.
+- The other common structure uses a `drop` or `reject` catch-all action that blocks *all* packets statelessly, and relies on users to write rules that allow both the application layer as well as any lower-layer handshaking that is necessary for the connections to progress to a point where the application layer becomes identifiable.  This option is preferred by more security-conscious users but requires an additional depth of network protocol knowledge and results in more complex rulesets.
+
+TLS is the most important of these protocols in our environment, and security policies that enforce TLS-only or allow only connections with certain TLS attributes are common.  Here again we also see opportunities where support for more explicit intent in the rule language could help rule writers to achieve the desired behavior.  With TLS we see this mostly around less common edge cases where behavior is not well-specified by the rules language:
+
+- One example involves signatures using the `ssl_state:client_hello` rule option, due to ambiguity around when the event applies in cases where the client hello is large enough to be split across more than one TCP segment.  In these cases the client hello is not an atomic event that corresponds to a single packet arrival, and the difference between "beginning of hello" and "end of hello" matter when unmatched packets will be subject to a default block action.
+- Another case relates to protocol upgrades and STARTTLS, where a default-deny decision to block needs to be made earlier than the point in the connection where it may upgrade itself to become TLS.  This is an example of a distinction that matters more when using a default-deny firewall rule structure than it does in IDS applications.
+
+I am very interested in community inputs and ideas in this area.  I think some of the more specific examples here for TLS could also be approached with a combination of documentation to reduce ambiguity and features that would allow users to better express their intent, but there is a larger general question around whether these types of rule structures are the best way to achieve a default-deny policy with Suricata, or if there are possibilities for a more "native" solution.  I think any solution that alleviates some of the cognitive load associated with lower-layer network details from the user and lets Suricata handle those details would be welcomed by firewall users.
+
+### Asymmetric routing
+
+A final topic worth mentioning is asymmetric routing - although the origins of this are in the network outside the Suricata box, it is nevertheless something that firewall operators need to remain conscious of when their intent is to allow only fully-inspected traffic or otherwise fail closed.  For the fail-closed principle it is better to completely block connections where only one half is inspected than it is to allow them to partially work anyway but with reduced effectiveness.  This eliminates the possibility of incorrectly-routed setups persisting by forcing users to fix the asymmetric routing problem because otherwise it won't work at all.
+
+The midstream-policy has allowed us to make some progress toward this by matching on asymmetrically-routed setups where the server-to-client direction is routed via a Suricata device but the client-to-server direction is not.  However the reverse is still possible - when the client-to-server side of a connection is seen by Suricata but the return path is not, the connections will superficially appear to work, although the configured rules will not actually be enforced on them because those connections will never advance beyond "new" state.  This can potentially result in a situation where users might think they are protected but aren't, at least for short connections - for connections lasting longer than the "new" connection timeout it introduces an additional difficult-to-troubleshoot problem because the flow state will then be cleared, at which point these connections that appeared to be working will start to match the midstream-policy.  I believe the upcoming [stream async policy](https://redmine.openinfosecfoundation.org/issues/6063) will help close this gap by allowing these connections to fail closed earlier.
+