The new Hyperscan 4.4 API provides a function to check for SSSE3
presence at runtime. This allows us to fall back to non-Hyperscan
matchers on systems without SSSE3 even when the suricata executable
is built with Hyperscan support. Addresses Redmine issue #2010.
Signed-off-by: Sascha Steinbiss <sascha@steinbiss.name>
Tested-by: Arturo Borrero Gonzalez <arturo@debian.org>
In preparation of the introduction of more general purpose prefilter
engines, rename PatternMatcherQueue to PrefilterRuleStore. The new
engines will fill this structure a similar way to the current mpm
prefilters.
So far, the patterns as passed to the mpm's would use global id's that
were shared among all buffers, directions. This would lead to a fairly
large pattern id space. As the mpm algo's use the pattern id's to
prevent duplicate matching through a pattern id based bitarray,
shrinking this space will optimize performance.
This patch implements this. It sets a flag before adding the pattern
to the mpm ctx, instructing the mpm to ignore the provided pid and
handle pids management itself. This leads to a shrinking of the
bitarray size.
This is made possible by the previous work that removes the pid logic
from the code.
Next to this, this patch moves the pattern setup stage to common util
functions. This avoids code duplication.
Update ac, ac-bs and ac-ks to use this.
Use an MPM specific pattern index, which is simply an index starting
at zero and incremented for each pattern added to the MPM, rather than
the externally provided Pattern ID (pid), since that can be much
larger than the number of patterns. The Pattern ID is shared across at
MPMs. For example, an MPM with one pattern with pid=8000 would result
in a max_pid of 8000, so the pid_pat_list would have 8000 entries.
The pid_pat_list[] is replaced by a array of pattern indexes. The PID is
moved to the SCACTilePatternList as a single value. The PatternList is
also indexed by the Pattern Index.
max_pat_id is no longer needed and mpm_ctx->pattern_cnt is used instead.
The local bitarray is then also indexed by pattern index instead of PID, making
it much smaller. The local bit array sets a bit for each pattern found
for this MPM. It is only kept during one MPM search (stack allocated).
One note, the local bit array is checked first and if the pattern has already
been found, it will stop checking, but count a match. This could result in
over counting matches of case-sensitve matches, since following case-insensitive
matches will also be counted. For example, finding "Foo" in "foo Foo foo" would
report finding "Foo" 2 times, mis-counting the third word as "Foo".
In PmqMerge() use MpmAddSids() instead of blindly copying the src
rule list onto the end of the dst rule list, since there might not
be enough room in the dst list. MpmAddSids() will resize the dst array
if needed.
Also add code to MpmAddSids() MpmAddPid() to better handle the case
that realloc fails to get more space. It first tries 2x the needed
space, but if that fails, it tries for just 1x. If that fails resize
returns 0. For MpmAddPid(), if resize fails, the new pid is lost. For
MpmAddSids(), as many SIDs as will fit are added, but some will be
lost.
Rather than creating the array of size maxpatid, dynamically resize as needed.
This also handles the case where duplicate pid are added to the array.
Also fix error in bitarray allocation (local version) to always use bitarray_size.
Rather than statically allocate 64K entries in every rule_id_array,
increase the size only when needed. Created a new function MpmAddSids()
to check the size before adding the new sids. If the array is not large
enough, it calls MpmAddSidsResize() that calls realloc and does error
checking. If the realloc fails, it prints an error and drops the new sids
on the floor, which seems better than exiting Suricata.
The size is increased to (current_size + new_count) * 2. This handles the
case where new_count > current_size, which would not be handled by simply
using current_size * 2. It should also be faster than simply reallocing to
current_size + new_count, which would then require another realloc for each
new addition.
Pmq add rule list: Array of uint32_t's to store (internal) sids from the MPM.
AC: store sids in the pattern list, append to Pmq::rule_id_array on match.
Detect: sort rule_id_array after it was set up by the MPM. Rule id's
(Signature::num) are ordered, and the rule's with the lowest id are to
be inspected first. As the MPM doesn't fill the array in order, but instead
'randomly' we need this sort step to assure proper inspection order.
app-layer.[ch], app-layer-detect-proto.[ch] and app-layer-parser.[ch].
Things addressed in this commit:
- Brings out a proper separation between protocol detection phase and the
parser phase.
- The dns app layer now is registered such that we don't use "dnstcp" and
"dnsudp" in the rules. A user who previously wrote a rule like this -
"alert dnstcp....." or
"alert dnsudp....."
would now have to use,
alert dns (ipproto:tcp;) or
alert udp (app-layer-protocol:dns;) or
alert ip (ipproto:udp; app-layer-protocol:dns;)
The same rules extend to other another such protocol, dcerpc.
- The app layer parser api now takes in the ipproto while registering
callbacks.
- The app inspection/detection engine also takes an ipproto.
- All app layer parser functions now take direction as STREAM_TOSERVER or
STREAM_TOCLIENT, as opposed to 0 or 1, which was taken by some of the
functions.
- FlowInitialize() and FlowRecycle() now resets proto to 0. This is
needed by unittests, which would try to clean the flow, and that would
call the api, AppLayerParserCleanupParserState(), which would try to
clean the app state, but the app layer now needs an ipproto to figure
out which api to internally call to clean the state, and if the ipproto
is 0, it would return without trying to clean the state.
- A lot of unittests are now updated where if they are using a flow and
they need to use the app layer, we would set a flow ipproto.
- The "app-layer" section in the yaml conf has also been updated as well.
Aho-Corasick mpm optimized for Tilera Tile-Gx architecture. Based on the
util-mpm-ac.c code base. The primary optimizations are:
1) Matching function used Tilera specific instructions.
2) Alphabet compression to reduce delta table size to increase cache
utilization and performance.
The basic observation is that not all 256 ASCII characters are used by
the set of multiple patterns in a group for which a DFA is
created. The first reason is that Suricata's pattern matching is
case-insensitive, so all uppercase characters are converted to
lowercase, leaving a hole of 26 characters in the
alphabet. Previously, this hole was simply left in the middle of the
alphabet and thus in the generated Next State (delta) tables.
A new, smaller, alphabet is created using a translation table of 256
bytes per mpm group. Previously, there was one global translation
table for converting upper case to lowercase.
Additional, unused characters are found by creating a histogram of all
the characters in all the patterns. Then all the characters with zero
counts are mapped to one character (0) in the new alphabet. Since
These characters appear in no pattern, they can all be mapped to a
single character and still result in the same matches being
found. Zero was chosen for the value in the new alphabet since this
"character" is more likely to appear in the input. The unused
character always results in the next state being state zero, but that
fact is not currently used by the code, since special casing takes
additional instructions.
The characters that do appear in some pattern are mapped to
consecutive characters in the new alphabet, starting at 1. This
results in a dense packing of next state values in the delta tables
and additionally can allow for a smaller number of columns in that
table, thus using less memory and better packing into the cache. The
size of the new alphabet is the number of used characters plus 1 for
the unused catch-all character.
The alphabet size is rounded up to the next larger power-of-2 so that
multiplication by the alphabet size can be done with a shift. It
might be possible to use a multiply instruction, so that the exact
alphabet size could be used, which would further reduce the size of
the delta tables, increase cache density and not require the
specialized search functions. The multiply would likely add 1 cycle to
the inner search loop.
Since the multiply by alphabet-size is cleverly merged with a mask
instruction (in the SINDEX macro), specialized versions of the
SCACSearch function are generated for alphabet sizes 256, 128, 64, 32
and 16. This is done by including the file util-mpm-ac-small.c
multiple times with a redefined SINDEX macro. A function pointer is
then stored in the mpm context for the search function. For alpha bit
sizes of 8 or smaller, the number of states usually small, so the DFA
is already very small, so there is little difference using the 16
state search function.
The SCACSearch function is also specialized by the size of the value
stored in the next state (delta) tables, either 16-bits or 32-bits.
This removes a conditional inside the Search function. That
conditional is only called once, but doesn't hurt to remove
it. 16-bits are used for up to 32K states, with the sign bit set for
states with matches.
Future optimization:
The state-has-match values is only needed per state, not per next
state, so checking the next-state sign bit could be replaced with
reading a different value, at the cost of an additional load, but
increasing the 16-bit next state span to 64K.
Since the order of the characters in the new alphabet doesn't matter,
the new alphabet could be sorted by the frequency of the characters in
the expected input stream for that multi-pattern matcher. This would
group more frequent characters into the same cache lines, thus
increasing the probability of reusing a cache-line.
All the next state values for each state live in their own set of
cache-lines. With power-of-two sizes alphabets, these don't overlap.
So either 32 or 16 character's next states are loaded in each cache
line load. If the alphabet size is not an exact power-of-2, then the
last cache-line is not completely full and up to 31*2 bytes of that
line could be wasted per state.
The next state table could be transposed, so that all the next states
for a specific character are stored sequentially, this could be better
if some characters, for example the unused character, are much more
frequent.