Goals:
- reduce locking
- take advantage of 'hot' caches
- better locality
Locking reduction
New flow spare pool. The global pool is implmented as a list of blocks,
where each block has a 100 spare flows. Worker threads fetch a block at
a time, storing the block in the local thread storage.
Flow Recycler now returns flows to the pool is blocks as well.
Flow Recycler fetches all flows to be processed in one step instead of
one at a time.
Cache 'hot'ness
Worker threads now check the timeout of flows they evaluate during lookup.
The worker will have to read the flow into cache anyway, so the added
overhead of checking the timeout value is minimal. When a flow is considered
timed out, one of 2 things happens:
- if the flow is 'owned' by the thread it is handled locally. Handling means
checking if the flow needs 'timeout' work.
- otherwise, the flow is added to a special 'evicted' list in the flow
bucket where it will be picked up by the flow manager.
Flow Manager timing
By default the flow manager now tries to do passes of the flow hash in
smaller steps, where the goal is to do full pass in 8 x the lowest timeout
value it has to enforce. So if the lowest timeout value is 30s, a full pass
will take 4 minutes. The goal here is to reduce locking overhead and not
get in the way of the workers.
In emergency mode each pass is full, and lower timeouts are used.
Timing of the flow manager is also no longer relying on pthread condition
variables, as these generally cause waking up much quicker than the desired
timout. Instead a simple (u)sleep loop is used.
Both changes reduce the number of hash passes a lot.
Emergency behavior
In emergency mode there a number of changes to the workers. In this scenario
the flow memcap is fully used up and it is unavoidable that some flows won't
be tracked.
1. flow spare pool fetches are reduced to once a second. This avoids locking
overhead, while the chance of success was very low.
2. getting an active flow directly from the hash skips flows that had very
recent activity to avoid the scenario where all flows get only into the
NEW state before getting reused. Rather allow some to have a chance of
completing.
3. TCP packets that are not SYN packets will not get a used flow, unless
stream.midstream is enabled. The goal here is again to avoid evicting
active flows unnecessarily.
Better Localily
Flow Manager injects flows into the worker threads now, instead of one or
two packets. Advantage of this is that the worker threads can get packets
from their local packet pools, avoiding constant overhead of packets returning
to 'foreign' pools.
Counters
A lot of flow counters have been added and some have been renamed.
Overall the worker threads increment 'flow.wrk.*' counters, while the flow
manager increments 'flow.mgr.*'.
Additionally, none of the counters are snapshots anymore, they all increment
over time. The flow.memuse and flow.spare counters are exceptions.
Misc
FlowQueue has been split into a FlowQueuePrivate (unlocked) and FlowQueue.
Flow no longer has 'prev' pointers and used a unified 'next' pointer for
both hash and queue use.
This commit adds MAC address output to the EVE-JSON format. We follow the
remarks made in Redmine ticket #962: for packets, log MAC src/dst as a
scalar field in EVE; for flows, log MAC src/dst as lists in EVE. Field names
are different between flow and packet context to avoid type confusion
(src_mac vs. src_macs). Configuration approach and JSON representation is
taken from previous GitHub PR #2700.
This commit checks whether pre-6.x settings for ERSPAN Type I are
present. ERSPAN Type I is no longer enabled/disabled through a
configuration setting -- it's always enabled.
When a setting exists to enable/disable ERSPAN Type I decoding, a
warning message is logged.
Enabling/disabling ERSPAN Type I decode has been deprecated in 6.x
Previously each 'TmSlot' had it's own packet queue that was passed
to the registered SlotFunc as an argument. This was used mostly for
tunnel packets by the decoders and by defrag.
This patch removes that in favor of a single queue in the ThreadVars:
decode_pq. This is the non-locked version of the queue as this is
only a temporary store for handling packets within a thread.
This patch removes the PacketQueue pointer argument from the API.
The new queue can be accessed directly through the ThreadVars
pointer.
Set the livedev on reassembled packets to that of the parent
packet. Fixes issues with multidetect, specifically a segfault
as reported in issue 3380.
Bug #3380.
Replace index by strchr and rindex by strrchr.
index(3) states "POSIX.1-2008 removes the specifications of index() and
rindex(), recommending strchr(3) and strrchr(3) instead."
Add index/rindex to banned function check so they don't get reintroduced.
Bug #1443.
This define is used to remove reference to capture bypass in case
no capture method implementing this is active.
This patch also introduces CAPTURE_OFFLOAD_MANAGER that is defined
if we need the flow bypass manager code.
Fill in the vlan_id fields unconditionally. We can now remove the check
for the vlan.use-for-tracking setting in decode.c. The debug log message
is moved to suricata.c.
Implement port config handling. Also check both src port and dest
port for tunnels that only set the destination port to the VXLAN
port. At the point of the check we don't know the packet direction
yet.
Implement as Suricata tunnel similar to Teredo.
Cleanups.
This patch introduces and uses a new bypass strategy
based on a callback. EBPF bypass implementation is
updated to use this new strategy.
Once the flow manager detect that a flow should be timeouted,
it asks the capture method if it has seen packets in the interval.
If it is the case the lastts of the flow is updated and the timeout
is postponed.
For capture method that have their own flow structure (not maintained
by Suricata), it can make sense to bypass a packet even if there is
no Flow in Suricata.
For AF_PACKET it does not make sense as the eBPF map entry will
be destroyed as soon as it will be checked by the flow bypass
manager. Thus we shortcut the bypass function if ever no Flow is
attached to the packet.
This path also removes reference to Flow in the bypass functions
for AF_PACKET. It was not necessary and we possibly could benefit
of it if ever we change the bypass algorithm.
There is a synchronization issue occuring when a flow is
added to the eBPF bypass maps. The flow can have packets
in the ring buffer that have already passed the eBPF stage.
By consequences, they are not accounted in the eBPF counter
but are accounted by Suricata flow engine.
This was causing counters to be completely wrong. This code
fixes the issue by avoiding the counter change in invalid
case.
To avoid adding 4 64bits integers to the Flow structure for the
bypass accounting, we use instead a FlowStorage. This limits the
memory usage to the size of a pointer.
In the eve log the decoder events are added as optional counters. This
behaviour is enabled by default. However, lots of the counters are
missing, as the names colide with other counters.
E.g.
decoder.ipv6 counts ipv6 packets
decoder.ipv6.unknown_next_header counts how often an unknown next
header is encountered.
In this example 'ipv6' would be both a json integer and a json object.
It appears that jansson favours the first that is generated, so the
event counters are mostly missing.
This patch registers them as 'decoder.events.<event>' instead. As
these names are generated on the fly, a hash table to contain the
allocated strings was added as well.
Invalid Teredo can lead to valid DNS traffic (or other UDP traffic)
being misdetected as Teredo. This leads to false negatives in the
UDP payload inspection.
Make the teredo code only consider a packet teredo if the encapsulated
data was decoded without any 'invalid' events being set.
Bug #2736.
Change the decode handler signature to increase the size of its decode
handler, from uint16 to uint32. This is necessary to let suricata use
interfaces with mtu > 65535 (ex: lo interface has default size 65536).
It's necessary to change several primitive for Packet manipulation, to
unify the parameter "packet length" whenever we are before IP decoding.
Add tests before calling DecodeIPVX function to avoid a possible
integer overflow over the len parameter.
When switching protocol from http to tls the following corner case
was observed:
pkt 6, TC "200 connection established"
pkt 7, TS acks pkt 6 + adds "client hello"
pkt 8 TC, acks pkt 7
pkt 8 is where normally the detect on the 200 connection established
would run however before detection runs the app-layer is called
and it resets the state
So the issue is missed detection on the last data in the original
protocol before the switch.
Another case was:
TS -> STARTTLS
TC -> Ack "STARTTLS data"
220
TS -> Ack "220 data"
Client Hello
In IDS mode, this made a rule that wanted to look at content:"STARTTLS"
in combination with the protocol SMTP 'alert smtp ... content:"STARTTLS";'
impossible. By the time the content would match, the protocol was already
switched.
This patch fixes this case by creating a 'Detect/Log Flush' packet in
both directions. This will force final inspection and logging of the
pre-upgrade protocol (SMTP in this example) before doing the final
switch.
Set flags by default:
-Wmissing-prototypes
-Wmissing-declarations
-Wstrict-prototypes
-Wwrite-strings
-Wcast-align
-Wbad-function-cast
-Wformat-security
-Wno-format-nonliteral
-Wmissing-format-attribute
-funsigned-char
Fix minor compiler warnings for these new flags on gcc and clang.
Due to the use of AFL_LOOP and initialization/deinit outside of it,
part of the fuzzing relied on the global 'state' in flow and defrag.
Because of this crashes that were found could not be reproduced. The
saved crash input was only the last in the series.
This patch addresses that. It requires a new output directory 'dump'
where the packet fuzzers will store all their input. If the AFL_LOOP
fails the files will not be removed and this 'serie' can be read
again for reproducing the issue.
e.g.: AFL would work with:
--afl-decoder-ppp=@@
and after a crash is found the produced serie can be read with:
--afl-decoder-ppp-serie=1486656919-514163
The series have a timestamp as name and a suffix that controls the
order in which the files will be 'replayed' in Suricata.
Call the packet bypass callback if necessary and update the flow
state. In case of failure we switch to local bypassed state and set
capture bypassed state if the callback is successful.
Add support for AFL PERSISTANT_MODE when Suricata is compiled with
a supported compiler (only afl-clang-fast for now).
This gives a ~10x performance boost when fuzzing.
We want to add counters in order to track the number of times we hit a
decode event. A decode event is related to an error in the protocol
decoding over a certain packet.
This patch fist modifies the decode-event list, reordering it in order
to separate single packet events from stream-related events and adding
the prefix "decoder" to decode events.
The counters are created during the decode setup and the relative event
counter is increased every time a packet with the flag PKT_IS_INVALID is
finalized in the decode phase
This adds a counter indicating how many times
the flow max memcap has been reached
Since there is no always a reference to FlowManagerThreadData,
the counter is put in DecodeThreadVars.
Currently when there is no counter increase in one call of FlowGetNew
because we don't have tv or dtv at the time of the call.
The following is a snippet of the generated EVE entry:
"flow":{"memcap":0,"spare":10000,"emerg_mode_entered":0,"emerg_mode_over":0,"tcp_reuse":0,"memuse":7085248}
Store the tenant id in the flow and use the stored id when setting
up pesudo packets.
For tunnel and defrag packets, get tenant from parent. This will only
pass tenant_id's set at capture time.
For defrag packets, the tenant selector based on vlan id will still
work as the vlan id(s) are stored in the defrag tracker before being
passed on.
Implement LINKTYPE_NULL for pcap live and pcap file.
From: http://www.tcpdump.org/linktypes.html
"BSD loopback encapsulation; the link layer header is a 4-byte field,
in host byte order, containing a PF_ value from socket.h for the
network-layer protocol of the packet.
Note that ``host byte order'' is the byte order of the machine on
which the packets are captured, and the PF_ values are for the OS
of the machine on which the packets are captured; if a live capture
is being done, ``host byte order'' is the byte order of the machine
capturing the packets, and the PF_ values are those of the OS of
the machine capturing the packets, but if a ``savefile'' is being
read, the byte order and PF_ values are not necessarily those of
the machine reading the capture file."
Feature ticket #1445
In flow timeout handling we need a function that allocate and blank
a place that will be used to put constructed packet data. This new
function has no other goal.
The field ext_pkt was cleaned before calling the release function.
The result was that IPS mode such as the one of AF_PACKET were not
working anymore because they were not able to send the data which
were initially pointed by ext_pkt.
This patch moves the ext_pkt cleaning to the cleaning macro. This
ensures that the cleaning is done for allocated and pool packets.
Using a stack for free Packet storage causes recently freed Packets to be
reused quickly, while there is more likelihood of the data still being in
cache.
The new structure has a per-thread private stack for allocating Packets
which does not need any locking. Since Packets can be freed by any thread,
there is a second stack (return stack) for freeing packets by other threads.
The return stack is protected by a mutex. Packets are moved from the return
stack to the private stack when the private stack is empty.
Returning packets back to their "home" stack keeps the stacks from getting out
of balance.
The PacketPoolInit() function is now called by each thread that will be
allocating packets. Each thread allocates max_pending_packets, which is a
change from before, where that was the total number of packets across all
threads.
Extended data were freed before the release function was called.
The result was that, in AF_PACKET IPS mode, the release function
was only sending void data because it the content of the extended
data is the content of the packet.
This patch updates the code to have the freeing of extended data
done in the cleaning function for a packet which is called by the
release function. This improves consistency of the code and fixes
the bug.
This patch introduces a new counter "decoder.vlan_qinq". It counts
packets that have more than two stacked vlan layers.
Packets with 2 vlan layers will both increment "decoder.vlan" and
"decoder.vlan_qinq".
When creating a pseudo packet with the reassembled IP packet, the
parent's vlan id or id's are also needed. The defrag packet is run
through decode and the flow engine, where the vlan id is necessary
for connecting the packet to the correct flow.
To be able to register counters from AppLayerGetCtxThread, the
ThreadVars pointer needs to be available in it and thus in it's
callers:
- AppLayerGetCtxThread
- DecodeThreadVarsAlloc
- StreamTcpReassembleInitThreadCtx
app-layer.[ch], app-layer-detect-proto.[ch] and app-layer-parser.[ch].
Things addressed in this commit:
- Brings out a proper separation between protocol detection phase and the
parser phase.
- The dns app layer now is registered such that we don't use "dnstcp" and
"dnsudp" in the rules. A user who previously wrote a rule like this -
"alert dnstcp....." or
"alert dnsudp....."
would now have to use,
alert dns (ipproto:tcp;) or
alert udp (app-layer-protocol:dns;) or
alert ip (ipproto:udp; app-layer-protocol:dns;)
The same rules extend to other another such protocol, dcerpc.
- The app layer parser api now takes in the ipproto while registering
callbacks.
- The app inspection/detection engine also takes an ipproto.
- All app layer parser functions now take direction as STREAM_TOSERVER or
STREAM_TOCLIENT, as opposed to 0 or 1, which was taken by some of the
functions.
- FlowInitialize() and FlowRecycle() now resets proto to 0. This is
needed by unittests, which would try to clean the flow, and that would
call the api, AppLayerParserCleanupParserState(), which would try to
clean the app state, but the app layer now needs an ipproto to figure
out which api to internally call to clean the state, and if the ipproto
is 0, it would return without trying to clean the state.
- A lot of unittests are now updated where if they are using a flow and
they need to use the app layer, we would set a flow ipproto.
- The "app-layer" section in the yaml conf has also been updated as well.
The uint8_t *pkt in the Packet structure always points to the memory
immediately following the Packet structure. It is better to simply
calculate that value every time than store the 8 byte pointer.
If we have multiple layer of tunnel, the decoding of initial
Packet will recurse in DecodeTunnel function called in
PacketTunnelPktSetup. If we are not setting the pseudo
packet root before calling DecodeTunnel (as done in previous
code), then the tunnel root will no be correct for the lower
layer packets. This result in an counter problem and a suricata
failure after some time.
This patch adds and increments a invalid packet counter. It
does this by introducing PacketDecodeFinalize function
This function is incrementing the invalid counter and is also
signalling the packet to CUDA.
This patch replaces PacketPseudoPktSetup by a better named
PacketTunnelPktSetup function which is also in charge of doing
the decoding of the tunneled packet.
This allow to clean the code. But it also fixes an issue.
Previously, if the DecodeTunnel function was failling (cause of
an invalid packet mainly), the result was that the original packet
to be considered as a tunnel packet (and not inspected by payload
detection).
In some cases, the decoding is not possible and some really invalid
packet can be created. This is in particular the case of tunnel. In
that case, it is more interesting to forget about the tunneled
packet and only consider the original packet.
DecodeTunnel function is maked as warn_unused_result because it is
meaningful for the decoder to know if the underlying data were not
correct. And in this case, only focus detection on the content.
This patch fixes a compilation failure on Solaris. Compiler does
not support when a function returning void is used in return of
an other function returning void.
In some cases using the vlan id(s) in flow hashing is problematic. Cases
of broken routers have been reported. So this option allows for disabling
the use of vlan id(s) while calculating the flow hash, and in the future
other hashes.
Vlan tracking for flow is enabled by default.
This commit allows handling Packets allocated by different methods.
The ReleaseData function pointer in the Packet structure is replaced
with ReleasePacket function pointer, which is then always called to
release the memory associated with a Packet.
Currently, the only usage of ReleaseData is in AF Packet. Previously
ReleaseData was only called when it was not NULL. To implement the
same functionality as before in AF Packet, a new function is defined
in AF Packet to first call the AFP specific ReleaseData function and
then releases the Packet structure.
Three new general functions are defined for releasing packets in the
default case:
1) PacketFree() - To release a packet alloced with SCMalloc()
2) PacketPoolReturnPacket() - For packets allocated from the Packet Pool.
Calls RECYCLE_PACKET(p)
3) PacketFreeOrRelease() - Calls PacketFree() or PacketPoolReturnPacket()
based on the PKT_ALLOC flag.
Having these functions removes the need to check the PKT_ALLOC flag
when releasing a packet in most cases, since the ReleasePacket
function encodes how the Packet was allocated. The PKT_ALLOC flag is
still set and is needed when AF Packet releases a packet, since it
replaces the ReleasePacket function pointer with its own function and
then calls PacketFreeOfRelease(), which uses the PKT_ALLOC flag.
The memset() inside PACKET_INITIALIZE() is redundant in some cases and
it is cleaner to do as part of the memory allocation. This simplifies
changes for integrating Tilera mPIPE support because the size of memory
cleared in that case is different from SIZE_OF_PACKET.
For the cases where Packets are directly allocated and then call
PACKET_INITIALIZE() without memset() first, this patch adds memset() calls.
A further change would use GetPacketFromAlloc() directly.
When handling error case on SCMallog, SCCalloc or SCStrdup
we are in an unlikely case. This patch adds the unlikely()
expression to indicate this to gcc.
This patch has been obtained via coccinelle. The transformation
is the following:
@istested@
identifier x;
statement S1;
identifier func =~ "(SCMalloc|SCStrdup|SCCalloc)";
@@
x = func(...)
... when != x
- if (x == NULL) S1
+ if (unlikely(x == NULL)) S1
When nothing can be fetch from the pool, this can repeat frequently.
Thus displaying a message in the log will not help. This patch
uses a counter instead of a log message. As this is a sort of memcap
this is conformed to what is done for other issues of the same type.
gcc on OpenBSD does not support C99 inline functions. This patch
modify the build system to handle this. It also change the order
of declaration of some functions to avoid to use them before
declaring them as inline.