Goals:
- reduce locking
- take advantage of 'hot' caches
- better locality
Locking reduction
New flow spare pool. The global pool is implmented as a list of blocks,
where each block has a 100 spare flows. Worker threads fetch a block at
a time, storing the block in the local thread storage.
Flow Recycler now returns flows to the pool is blocks as well.
Flow Recycler fetches all flows to be processed in one step instead of
one at a time.
Cache 'hot'ness
Worker threads now check the timeout of flows they evaluate during lookup.
The worker will have to read the flow into cache anyway, so the added
overhead of checking the timeout value is minimal. When a flow is considered
timed out, one of 2 things happens:
- if the flow is 'owned' by the thread it is handled locally. Handling means
checking if the flow needs 'timeout' work.
- otherwise, the flow is added to a special 'evicted' list in the flow
bucket where it will be picked up by the flow manager.
Flow Manager timing
By default the flow manager now tries to do passes of the flow hash in
smaller steps, where the goal is to do full pass in 8 x the lowest timeout
value it has to enforce. So if the lowest timeout value is 30s, a full pass
will take 4 minutes. The goal here is to reduce locking overhead and not
get in the way of the workers.
In emergency mode each pass is full, and lower timeouts are used.
Timing of the flow manager is also no longer relying on pthread condition
variables, as these generally cause waking up much quicker than the desired
timout. Instead a simple (u)sleep loop is used.
Both changes reduce the number of hash passes a lot.
Emergency behavior
In emergency mode there a number of changes to the workers. In this scenario
the flow memcap is fully used up and it is unavoidable that some flows won't
be tracked.
1. flow spare pool fetches are reduced to once a second. This avoids locking
overhead, while the chance of success was very low.
2. getting an active flow directly from the hash skips flows that had very
recent activity to avoid the scenario where all flows get only into the
NEW state before getting reused. Rather allow some to have a chance of
completing.
3. TCP packets that are not SYN packets will not get a used flow, unless
stream.midstream is enabled. The goal here is again to avoid evicting
active flows unnecessarily.
Better Localily
Flow Manager injects flows into the worker threads now, instead of one or
two packets. Advantage of this is that the worker threads can get packets
from their local packet pools, avoiding constant overhead of packets returning
to 'foreign' pools.
Counters
A lot of flow counters have been added and some have been renamed.
Overall the worker threads increment 'flow.wrk.*' counters, while the flow
manager increments 'flow.mgr.*'.
Additionally, none of the counters are snapshots anymore, they all increment
over time. The flow.memuse and flow.spare counters are exceptions.
Misc
FlowQueue has been split into a FlowQueuePrivate (unlocked) and FlowQueue.
Flow no longer has 'prev' pointers and used a unified 'next' pointer for
both hash and queue use.
This commit adds MAC address output to the EVE-JSON format. We follow the
remarks made in Redmine ticket #962: for packets, log MAC src/dst as a
scalar field in EVE; for flows, log MAC src/dst as lists in EVE. Field names
are different between flow and packet context to avoid type confusion
(src_mac vs. src_macs). Configuration approach and JSON representation is
taken from previous GitHub PR #2700.
Previously each 'TmSlot' had it's own packet queue that was passed
to the registered SlotFunc as an argument. This was used mostly for
tunnel packets by the decoders and by defrag.
This patch removes that in favor of a single queue in the ThreadVars:
decode_pq. This is the non-locked version of the queue as this is
only a temporary store for handling packets within a thread.
This patch removes the PacketQueue pointer argument from the API.
The new queue can be accessed directly through the ThreadVars
pointer.
Fill in the vlan_id fields unconditionally. We can now remove the check
for the vlan.use-for-tracking setting in decode.c. The debug log message
is moved to suricata.c.
Since the vlan.use-for-tracking setting is now handled in flow-hash.c,
we can fill in the vlan_id fields unconditionally. This makes the vlanh
fields unnecessary.
Related to https://redmine.openinfosecfoundation.org/issues/3076
Implement port config handling. Also check both src port and dest
port for tunnels that only set the destination port to the VXLAN
port. At the point of the check we don't know the packet direction
yet.
Implement as Suricata tunnel similar to Teredo.
Cleanups.
No longer set stream events after a gap or wrong thread. We know
we lost sync and are now in 'lets make the best of it'-mode. No
point in flooding the system with stream events.
Ticket #2484
In the eve log the decoder events are added as optional counters. This
behaviour is enabled by default. However, lots of the counters are
missing, as the names colide with other counters.
E.g.
decoder.ipv6 counts ipv6 packets
decoder.ipv6.unknown_next_header counts how often an unknown next
header is encountered.
In this example 'ipv6' would be both a json integer and a json object.
It appears that jansson favours the first that is generated, so the
event counters are mostly missing.
This patch registers them as 'decoder.events.<event>' instead. As
these names are generated on the fly, a hash table to contain the
allocated strings was added as well.
Invalid Teredo can lead to valid DNS traffic (or other UDP traffic)
being misdetected as Teredo. This leads to false negatives in the
UDP payload inspection.
Make the teredo code only consider a packet teredo if the encapsulated
data was decoded without any 'invalid' events being set.
Bug #2736.
Change the decode handler signature to increase the size of its decode
handler, from uint16 to uint32. This is necessary to let suricata use
interfaces with mtu > 65535 (ex: lo interface has default size 65536).
It's necessary to change several primitive for Packet manipulation, to
unify the parameter "packet length" whenever we are before IP decoding.
Add tests before calling DecodeIPVX function to avoid a possible
integer overflow over the len parameter.
Enables IPS functionality on Windows using the open-source
(LGPLv3/GPLv2) WinDivert driver and API.
From https://www.reqrypt.org/windivert-doc.html : "WinDivert is a
user-mode capture/sniffing/modification/blocking/re-injection package
for Windows Vista, Windows Server 2008, Windows 7, and Windows 8.
WinDivert can be used to implement user-mode packet filters, packet
sniffers, firewalls, NAT, VPNs, tunneling applications, etc., without
the need to write kernel-mode code."
- adds `--windivert [filter string]` and `--windivert-forward [filter
string]` command-line options to enable WinDivert IPS mode.
`--windivert[-forward] true` will open a filter for all traffic. See
https://www.reqrypt.org/windivert-doc.html#filter_language for more
information.
Limitation: currently limited to `autofp` runmode.
Additionally:
- `tmm_modules` now zeroed during `RegisterAllModules`
- fixed Windows Vista+ `inet_ntop` call in `PrintInet`
- fixed `GetRandom` bug (nonexistent keys) on fresh Windows installs
- fixed `RandomGetClock` building on Windows builds
- Added WMI queries for MTU
pfring.h brings a different version of likely/unlikely that gives
warnings. So make sure we include our own before.
Make sure pfring.h isn't included globally due to apparent redefinition
of pthread_rwlock_t.
This patch adds support for hw bypass by enabling flow offload in the network
card (when supported) and implementing the BypassPacketsFlow callback.
Hw bypass support is disabled by default, and can be enabled by setting
"bypass: yes" in the pfring interface configuration section in suricata.yaml.
added util-napatech module which contains implementation threads
for processing statistics. And modified source-napatech and
runmode-napatech to instantiate the threads.
napatech: Implementation of packet counters
napatech: implementation of statistics counters
napatech: Implementation of packet counters.
napatech: added util-napatech module
napatech: added utils-napatech module.
added include declaration and napatech specific structure when HAVE_NAPATECH
is defined.
Added util-napatech module to project.
On OpenBSD 6.0 and 6.1 the following pcap gets a datalink type of
101 instead of our defined DLT_RAW.
File type: Wireshark/tcpdump/... - pcap
File encapsulation: Raw IP
File timestamp precision: microseconds (6)
Packet size limit: file hdr: 262144 bytes
Number of packets: 23
File size: 11 kB
Data size: 11 kB
Capture duration: 7,424945 seconds
First packet time: 2017-05-25 21:59:31,957953
Last packet time: 2017-05-25 21:59:39,382898
Data byte rate: 1536 bytes/s
Data bit rate: 12 kbps
Average packet size: 496,00 bytes
Average packet rate: 3 packets/s
SHA1: 120cff9878b93ac74b68fb9216027bef3b3c018f
RIPEMD160: 35fa287bf30d8be8b8654abfe26e8d3883262e8e
MD5: 13fe4bc50fe09bdd38f07739bd1ff0f0
Strict time order: True
Number of interfaces in file: 1
Interface #0 info:
Encapsulation = Raw IP (7/101 - rawip)
Capture length = 262144
Time precision = microseconds (6)
Time ticks per second = 1000000
Number of stat entries = 0
Number of packets = 23
On Linux it is 12.
On the tcpdump/libpcap site the DLT_RAW is defined as 101:
http://www.tcpdump.org/linktypes.html
Strangely, on OpenBSD the DLT_RAW macro is defined as 14 as expected.
So for some reason, libpcap on OpenBSD uses 101 which seems to match
the tcpdump/libpcap documentation.
So this patch adds support for datalink 101 as RAW.
Observed:
STARTTLS creates 2 pseudo packets which are tied to a real packet.
TPR (tunnel packet ref) counter increased to 2.
Pseudo 1: goes through 'verdict', increments 'ready to verdict' to 1.
Packet pool return code frees this packet and decrements TPR in root
to 1. RTV counter not changed. So both are now 1.
Pseudo 2: verdict code sees RTV == TPR, so verdict is set based on
pseudo packet. This is too soon. Packet pool return code frees this
packet and decrements TPR in root to 0.
Real packet: TRP is 0 so set verdict on this packet. As verdict was
already set, NFQ reports an issue.
The decrementing of TPR doesn't seem to make sense as RTV is not
updated.
Solution:
This patch refactors the ref count and verdict count logic. The beef
is now handled in the generic function TmqhOutputPacketpool(). NFQ
and IPFW call a utility function VerdictTunnelPacket to see if they
need to verdict a packet.
Remove some unused macro's for managing these counters.
When switching protocol from http to tls the following corner case
was observed:
pkt 6, TC "200 connection established"
pkt 7, TS acks pkt 6 + adds "client hello"
pkt 8 TC, acks pkt 7
pkt 8 is where normally the detect on the 200 connection established
would run however before detection runs the app-layer is called
and it resets the state
So the issue is missed detection on the last data in the original
protocol before the switch.
Another case was:
TS -> STARTTLS
TC -> Ack "STARTTLS data"
220
TS -> Ack "220 data"
Client Hello
In IDS mode, this made a rule that wanted to look at content:"STARTTLS"
in combination with the protocol SMTP 'alert smtp ... content:"STARTTLS";'
impossible. By the time the content would match, the protocol was already
switched.
This patch fixes this case by creating a 'Detect/Log Flush' packet in
both directions. This will force final inspection and logging of the
pre-upgrade protocol (SMTP in this example) before doing the final
switch.
Set flags by default:
-Wmissing-prototypes
-Wmissing-declarations
-Wstrict-prototypes
-Wwrite-strings
-Wcast-align
-Wbad-function-cast
-Wformat-security
-Wno-format-nonliteral
-Wmissing-format-attribute
-funsigned-char
Fix minor compiler warnings for these new flags on gcc and clang.
Remove the 'StreamMsg' approach from the engine. In this approach the
stream engine would create a list of chunks for inspection by the
detection engine. There were several issues:
1. the messages had a fixed size, so blocks of data bigger than ~4k
would be cut into multiple messages
2. it lead to lots of data copying and unnecessary memory use
3. the StreamMsgs used a central pool
The Stream engine switched over to the streaming buffer API, which
means that the reassembled data is always available. This made the
StreamMsg approach even clunkier.
The new approach exposes the streaming buffer data to the detection
engine. It has to pay attention to an important issue though: packet
loss. The data may have gaps. The streaming buffer API tracks the
blocks of continuous data.
To access the data for inspection a callback approach is used. The
'StreamReassembleRaw' function is called with a callback and data.
This way it runs the MPM and individual rule inspection code. At
the end of each detection run the stream engine is notified that it
can move forward it's 'progress'.
Due to the use of AFL_LOOP and initialization/deinit outside of it,
part of the fuzzing relied on the global 'state' in flow and defrag.
Because of this crashes that were found could not be reproduced. The
saved crash input was only the last in the series.
This patch addresses that. It requires a new output directory 'dump'
where the packet fuzzers will store all their input. If the AFL_LOOP
fails the files will not be removed and this 'serie' can be read
again for reproducing the issue.
e.g.: AFL would work with:
--afl-decoder-ppp=@@
and after a crash is found the produced serie can be read with:
--afl-decoder-ppp-serie=1486656919-514163
The series have a timestamp as name and a suffix that controls the
order in which the files will be 'replayed' in Suricata.
If segments section in the yaml is ommitted (default) or when the
pool size is set to 'from_mtu', the size of the pool will be MTU
minus 40. If the MTU couldn't be determined, it's assumed to be
1500, so the segment size for the bool will be 1460.
Fix rate_filter issues: if action was modified it wouldn't be logged
in EVE. To address this pass the PacketAlert structure to the threshold
code so it can flag the PacketAlert as modified. Use this in logging.
Update API to use const where possible. Fix a timout issue that this
uncovered.
Add negated matches to match list instead of amatch.
Allow matching on 'failed'.
Introduce per packet flags for proto detection. Flags are used to
only inspect once per direction. Flag packet on PD-failure too.
Previously the detect and stream code lived in their own thread
modules. This meant profiling showed their cost as part of the
thread module profiling logic. Now that only the flow worker is
a thread module this no longer works.
This patch introduces profiling for the 3 current flow worker
steps: flow, stream, detect.
Instead of handling the packet update during flow lookup, handle
it in the stream/detect threads. This lowers the load of the
capture thread(s) in autofp mode.
The decoders now set a flag in the packet if the packet needs a
flow lookup. Then the workers will take care of this. The decoders
also already calculate the raw flow hash value. This is so that
this value can be used in flow balancing in autofp.
Because the flow lookup/creation is now done in the worker threads,
the flow balancing can no longer use the flow. It's not yet
available. Autofp load balancing uses raw hash values instead.
In the same line, move UDP AppLayer out of the DecodeUDP module,
and also into the stream/detect threads.
Handle TCP session reuse inside the flow engine itself. If a looked up
flow matches the packet, but is a TCP stream starter, check if the
ssn needs to be reused. If that is the case handle it within the
lookup function. Simplies the locking and removes potential race
conditions.
We want to add counters in order to track the number of times we hit a
decode event. A decode event is related to an error in the protocol
decoding over a certain packet.
This patch fist modifies the decode-event list, reordering it in order
to separate single packet events from stream-related events and adding
the prefix "decoder" to decode events.
The counters are created during the decode setup and the relative event
counter is increased every time a packet with the flag PKT_IS_INVALID is
finalized in the decode phase
This adds a counter indicating how many times
the flow max memcap has been reached
Since there is no always a reference to FlowManagerThreadData,
the counter is put in DecodeThreadVars.
Currently when there is no counter increase in one call of FlowGetNew
because we don't have tv or dtv at the time of the call.
The following is a snippet of the generated EVE entry:
"flow":{"memcap":0,"spare":10000,"emerg_mode_entered":0,"emerg_mode_over":0,"tcp_reuse":0,"memuse":7085248}
Put common counters on the first cache line. Please the flow output
pointer last as it's use depends on the flow logging being enabled
and even then it's only called very rarely.
Implement LINKTYPE_NULL for pcap live and pcap file.
From: http://www.tcpdump.org/linktypes.html
"BSD loopback encapsulation; the link layer header is a 4-byte field,
in host byte order, containing a PF_ value from socket.h for the
network-layer protocol of the packet.
Note that ``host byte order'' is the byte order of the machine on
which the packets are captured, and the PF_ values are for the OS
of the machine on which the packets are captured; if a live capture
is being done, ``host byte order'' is the byte order of the machine
capturing the packets, and the PF_ values are those of the OS of
the machine capturing the packets, but if a ``savefile'' is being
read, the byte order and PF_ values are not necessarily those of
the machine reading the capture file."
Feature ticket #1445
Set actions that are set directly from Signatures using the new
utility function DetectSignatureApplyActions. This will apply
the actions and also store info about the 'drop' that first made
the rule drop.
In flow timeout handling we need a function that allocate and blank
a place that will be used to put constructed packet data. This new
function has no other goal.
The field ext_pkt was cleaned before calling the release function.
The result was that IPS mode such as the one of AF_PACKET were not
working anymore because they were not able to send the data which
were initially pointed by ext_pkt.
This patch moves the ext_pkt cleaning to the cleaning macro. This
ensures that the cleaning is done for allocated and pool packets.
Most flows are marked for clean up by the flow manager, which then
passes them to the recycler. The recycler logs and cleans up. However,
under resource stress conditions, the packet threads can recycle
existing flow directly. So here the recycler has no role to play, as
the flow is immediately used.
For this reason, the packet threads need to be able to invoke the
flow logger directly.
The flow logging thread ctx will stored in the DecodeThreadVars
stucture. Therefore, this patch makes the DecodeThreadVars an argument
to FlowHandlePacket.
Using a stack for free Packet storage causes recently freed Packets to be
reused quickly, while there is more likelihood of the data still being in
cache.
The new structure has a per-thread private stack for allocating Packets
which does not need any locking. Since Packets can be freed by any thread,
there is a second stack (return stack) for freeing packets by other threads.
The return stack is protected by a mutex. Packets are moved from the return
stack to the private stack when the private stack is empty.
Returning packets back to their "home" stack keeps the stacks from getting out
of balance.
The PacketPoolInit() function is now called by each thread that will be
allocating packets. Each thread allocates max_pending_packets, which is a
change from before, where that was the total number of packets across all
threads.
For packets that were freed, not recycled, profiling memory wasn't
freed:
==15745== 13,312 bytes in 8 blocks are definitely lost in loss record 611 of 615
==15745== at 0x4C2C494: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15745== by 0xA190D5: SCProfilePacketStart (util-profiling.c:963)
==15745== by 0x4E4345: PacketGetFromAlloc (decode.c:134)
==15745== by 0x83FE75: FlowForceReassemblyPseudoPacketGet (flow-timeout.c:276)
==15745== by 0x8413BF: FlowForceReassemblyForHash (flow-timeout.c:588)
==15745== by 0x841897: FlowForceReassembly (flow-timeout.c:716)
==15745== by 0x9540F6: main (suricata.c:2296)
==15745==
==15745== 14,976 bytes in 9 blocks are definitely lost in loss record 612 of 615
==15745== at 0x4C2C494: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==15745== by 0xA190D5: SCProfilePacketStart (util-profiling.c:963)
==15745== by 0x4E4345: PacketGetFromAlloc (decode.c:134)
==15745== by 0x83FE75: FlowForceReassemblyPseudoPacketGet (flow-timeout.c:276)
==15745== by 0x841508: FlowForceReassemblyForHash (flow-timeout.c:620)
==15745== by 0x841897: FlowForceReassembly (flow-timeout.c:716)
==15745== by 0x9540F6: main (suricata.c:2296)
This patch addresses that.
This patch introduces a new counter "decoder.vlan_qinq". It counts
packets that have more than two stacked vlan layers.
Packets with 2 vlan layers will both increment "decoder.vlan" and
"decoder.vlan_qinq".
Instead of a large (6k+) structure in the Packet, make the profiling
storage dynamic. To do this the Packet->profile is now a pointer.
Initial support for selective sampling, e.g. only profile every
1000th packet.
Move app layer event handling into app-layer-event.[ch].
Convert 'Set' macro's to functions.
Get rid of duplication in Set and SetRaw. Set now calls SetRaw.
Fix potentential int overflow condition in the event storage.
Update callers.