• [ Pobierz całość w formacie PDF ]
    An Architecture for Generating Semantic-Aware Signatures
    Vinod Yegneswaran, Jonathon T. Giffin, Paul Barford, Somesh Jha
    Abstract
    Identifying new intrusion exploits and developing effective detection signatures for them is essential
    for protecting computer networks. We present
    Nemean
    , a system for automatic generation of intrusion
    signatures from honeynet packet traces. Our architecture is distinguished by its emphasis on a
    modular
    design framework
    that encourages independent development and modification of system components and
    protocol semantic awareness
    which allows construction of signatures that greatly reduce false alarms. The
    building blocks of our architecture include transport and service normalization, intrusion profile clustering
    and automata learning that generates connection and session aware signatures. We evaluate our archi-
    tecture through a prototype implementation that demonstrates the potential of semantic-aware, resilient
    signatures. For example, signatures generated by Nemean for NetBIOS exploits had a 0% false-positive
    rate and a 0.04% false-negative rate.
    1 Introduction
    Computer network security is a multidimensional activity that continues to grow in importance. The
    prevalence of attacks in the Internet and the ability of self-propagating worms to infect millions of Internet
    hosts has been well documented
    . Developing techniques and tools that enable more precise and more
    rapid detection of such attacks presents significant challenges to both the research and operational communi-
    ties.
    Network security architectures often include network intrusion detection systems (NIDS) that monitor
    packet traffic between networks and raise alarms when malicious activity is observed. NIDS that employ
    misuse-detection
    compare traffic against a hand-built database of signatures or patterns that identify previ-
    ously documented attack profiles
    . While the effectiveness of a misuse-detector is linked tightly to the
    quality of its signature database, competing requirements make generating and maintaining NIDS signatures
    difficult. On one hand, signatures should be
    specific
    : they should only identify the characteristics of specific
    attack profiles. The lack of specificity leads to false alarms–one of the major problems for NIDS today. For
    example, Sommer and Paxson argue that including context, such as the victim’s response, in NIDS signatures
    may reduce false alarm rates
    . On the other hand, signatures should be
    general
    so that they match variants
    of specific attack profiles. For example, a signature that does not account for transport or application level se-
    mantics can lead to false alarms
    . Thus, a balance between specificity and generality is an important
    objective for signatures.
    We present the design and implementation of an architecture called
    Nemea
    for automatic generation of
    signatures for misuse-detection. Nemean aims to create signatures that result in lower false alarm rates by
    balancing specificity and generality. We achieve this balance by including
    semantic awareness
    , or the ability
    to understand session-layer and application-layer protocol semantics. Examples of session layer protocols
    include NetBIOS and RPC, and application layer protocols include SMB, TELNET, NCP and HTTP. Increas-
    ingly, pre-processors for these protocols have become integral parts of NIDS. We argue that these capabilities
    are essential for automatic signature generation systems for the following reasons:
    1. Semantic awareness enables signatures to be generated for attacks in which the exploit is a small part
    of the entire payload.
    2. Semantic awareness enables signatures to be generated for multi-step attacks in which the exploit does
    not occur until the last step.
    3. Semantic awareness allows weights to be assigned to different portions of the payload (
    e.g.,
    timestamps,
    sequence numbers, or proxy-cache headers) based upon their significance.
    4. Semantic awareness helps produce generalized signatures from a small number of input samples.
    5. Semantic awareness results in signatures that are easy to understand and validate.
    1
    The first labor of the Greek hero Heracles was to rid the Nemean plain of a fierce creature known as the Nemean Lion. After
    slaying the beast, Heracles wore its pelt as impenetrable armor in his future labors.
    1
      Our architecture contains two components: a
    data abstraction component
    that normalizes packets from in-
    dividual sessions and renders semantic context and a
    signature generation component
    that groups similar ses-
    sions and uses machine-learning techniques to generate signatures for each cluster. The signatures produced
    are suitable for deployment in a NIDS
    . We address specificity by producing both connection-level
    and session-level signatures. We address generality by learning signatures from transport-normalized data
    and consideration of application-level semantics that enables variants of attacks to be detected. Therefore, we
    argue that Nemean generates
    balanced
    signatures.
    The input to Nemean is a set of packet traces collected from a honeynet deployed on unused IP address
    space. Any data observed at a honeynet
    is anomalous, thus eliminating both the problem of privacy and
    the problem of separating malicious and normal traffic
    We assume that the honeynet is subject to the same
    attack traffic as standard hosts and discuss the ramifications of this assumption in Section
    To evaluate Nemean’s architecture, we developed a prototype implementation of each component. This
    implementation enables automated generation of signatures from honeynet packet traces. We also developed
    a simple alert generation tool for off-line analysis which compares packet traces against signatures. While
    we demonstrate that our current implementation is extremely effective, the modular design of the architecture
    enables any of the individual components to be easily replaced. We expect that further developments will
    tune and expand individual components resulting in more timely, precise and effective signatures. From
    a broader perspective, we believe that our results demonstrate the importance of Nemean’s capability in a
    comprehensive security architecture. Section
    describes the architecture and Sections
    and
    present our
    prototype implementation of Nemean.
    We performed two evaluations of our prototype. First, we calculated detection and misdiagnosis counts
    using packet traces collected at two unused /19 address ranges (32K total IP addresses) from two distinct Class
    B networks allocated to our campus. We collected session-level data for exploits targeting ports 80 (HTTP),
    139 and 445 (NetBIOS/SMB). Section
    describes the data collection environment. We use this packet trace
    data as input to Nemean to produce a comprehensive signature set for the three target ports. In Section
    we
    describe the major clusters and the signatures produced from this data set. Leave-out testing results indicate
    that our system generates accurate signatures for most common intrusions, including Code Red, Nimda, and
    other popular exploits. We detected 100% of the HTTP exploits and 99.96% of the NetBIOS exploits with 0
    misdiagnoses. Next, we validated our signatures by testing for false alarms using packet traces of all HTTP
    traffic collected from our department’s border router. Nemean produced 0 false alarms for this data set. By
    comparison, Snort [
    generated over 80,000 false alarms on the same data set. These results suggest that even
    with a much smaller signature set, Nemean achieves detectability rates on par with Snort while identifying
    attacks with superior precision and far fewer false alarms.
    2 Related Work
    Sommer and Paxson [
    propose adding connection-level context to signatures to reduce false positives
    in misuse-detection. In
    , Handley
    et al
    . describe transport-level evasion techniques designed to elude a
    NIDS as well as normalization methods that disambiguate data before comparison against a signature. Similar
    work describes common HTTP evasion techniques and standard URL morphing attacks
    . Vigna
    et al.
    describe several mutations and demonstrate that two widely deployed misuse-detectors are susceptible to such
    mutations. The work of Handley
    et al.
    and Vigna
    et al.
    highlights the importance of incorporating semantics
    into the signature-generation process.
    Honeypots are an excellent source of data for intrusion and attack analysis. Levin
    et al
    . describe how
    honeypots extract details of worm exploits that can be analyzed to generate detection signatures [
    . Their
    signatures must be generated manually.
    Several automated signature generation systems have been proposed. Table
    summarizes the differences
    between Nemean and the other signature-generation systems. One of the first systems proposed was Hon-
    eycomb developed by Kreibich and Crowcroft
    . Like Nemean, Honeycomb generates signatures from
    traffic observed at a honeypot and is implemented as a Honeyd
    plugin. At the heart of Honeycomb is the
    2
    A honeynet is a network of high-interaction honeypots.
    3
    A negligible amount of non-malicious traffic on our honeynet is cause by misconfigurations and is easily separated from the
    malicious traffic.
    4
    Honeyd is a popular open-source low-interaction honeypot tool that simulates virtual machines over unused IP address space.
    2
       Traffic source
    Generates Contextual
    Semantic
    Signature Generation
    Target
    Signatures
    Aware
    Algorithm
    Attack Class
    Nemean
    Honeypots
    Yes
    (Generates connection- and
    Yes
    (MSG) Clustering
    General
    session- level signatures)
    and automata learning
    Autograph
    DMZ
    No
    (Generates
    No
    (COPP) partitioning
    Worm
    byte-level signatures)
    content blocks
    EarlyBird
    DMZ
    No
    (Generates
    No
    Measuring
    Worm
    byte-level signatures)
    packet-content prevalence
    Honeycomb
    Honeypots
    No
    (Generates
    No
    Pairwise LCS
    General
    byte-level signatures) across connections
    Figure 1: Comparison of Nemean to other signature-generation systems.
    longest common substring
    (LCS) algorithm that looks for the longest shared byte sequences across pairs of
    connections. However, since Honeycomb does not consider protocol semantics, its pairwise LCS algorithm
    outputs a large number of signatures. It is also frequently distracted by long irrelevant byte sequences in
    packet payloads, thus reducing its capability for identifying attacks with small exploit strings, exemplified in
    protocols such as NetBIOS. We discuss this in greater detail in Section
    Kim and Karp
    describe the Autograph system for automated generation of signatures to detect worms.
    Unlike Honeycomb and Nemean, Autograph’s input are packet traces from a DMZ that includes benign traffic.
    Content blocks that match “enough” suspicious flows are used as input to COPP, an algorithm based on Rabin
    fingerprints that searches for repeated byte sequences by partitioning the payload into content blocks. Like
    Honeycomb, Autograph does not consider protocol semantics. We argue that such approaches, while attractive
    in principle, seem viable for a rather limited spectrum of observed attacks and are prone to false positives. This
    also makes Autograph more susceptible to mutation attacks
    . Finally, unlike byte-level signatures
    produced by Autograph, Nemean can produce both connection-level and session-level signatures.
    Another system developed to generate signatures for worms, Earlybird [
    , measures packet-content
    prevalence at a single monitoring point such as a network DMZ. By counting the number of distinct sources
    and destinations associated with strings that repeat often in the payload, Earlybird distinguishes benign repe-
    titions from epidemic content. Like Autograph, Earlybird also produces byte-level signatures and is not aware
    of protocol semantics. Hence Earlybird has the same disadvantages compared to Nemean as Autograph.
    Pouget and Dacier
    analyze honeypot traffic to identify root causes of frequent processes observed
    in a honeypot environment. They first organize the observed traffic based on the port sequence. Then, the
    data is clustered using association-rules mining
    . The resulting clusters are further refined using “phrase
    distance” (which is similar to the hierarchical edit distance metric described in Section
    between attack
    payloads. Pouget and Dacier’s technique is not semantically aware. Julisch
    also clusters alarms for the
    purpose of discovering the root-cause of an alarm. After clustering the alarms, Julisch’s technique generates
    a
    generalized alarm
    for each cluster. Intuitively, generation of generalized alarms is similar to the automata-
    learning step of our algorithm. However, the goals and techniques used in our work are different than the ones
    used by Julisch.
    Anomaly detection
    is an alternative approach for malicious traffic identification in a NIDS. Anomaly
    detectors construct a model of acceptable behavior and then flag any deviations from the model as suspicious.
    Anomaly-detection techniques for detecting port scans have been explored in
    . Balancing specificity
    and generality has proven extraordinarily difficult in anomaly detection, and such systems often produce high
    rates of false alarms. This paper focuses on misuse-detection, and we will not investigate anomaly-detecting
    techniques further.
    3 Nemean Architecture
    As shown in Figure
    Nemean’s architecture is divided into two components: the data abstraction com-
    ponent and the signature generation component. The input to Nemean is a packet trace collected from a
    honeynet. Even when deployed on a small address space (
    e.g
    ., a /24 containing 256 IP addresses), a honeynet
    can provide a large volume of data without significant privacy or false positives concerns.
    3
       DATA ABSTRACTION COMPONENT
    SIGNATURE GENERATION COMPONENT
    Packet
    Trace
    Connection
    Clustering
    Generalization
    rules
    Transport
    Packets
    Normalization
    Flow Aggregation
    Connection
    or Session
    Connection
    Clusters
    Sessions
    Semi−structured
    Signatures
    Session Tree
    Automata
    Learning
    Per Service
    Specification
    Service
    Normalization
    Session
    Clustering
    Session
    Clusters
    Figure 2: Components and data flow description of the Nemean architecture
    3.1 Data Abstraction Component
    The Data Abstraction Component (DAC) aggregates and transforms the packet trace into a well-defined data
    structure suitable for clustering by a generic clustering module without specific knowledge of the transport
    protocol or application-level semantics. We call these aggregation units
    semi-structured session trees (SSTs)
    .
    The components of the DAC can then be thought of in terms of the data flow through the module as shown
    in Figure
    While we built our own DAC module, in principle it could be implemented as an extension to a
    standard NIDS, such as a Bro policy script
    .
    Transport normalization
    disambiguates obfuscations at the network and transport layers of the protocol
    stack. Our DAC reads packet traces through the
    libpcap
    library. This can either be run online or offline on
    tcpdump
    traces. This step considers transport-specific obfuscations like fragmentation reassembly, duplicate
    suppression, and checksums. We describe these in greater detail in Section
    The
    aggregation
    step groups packet data between two hosts into sessions. The normalized packet data is
    first composed and stored as
    flows
    . Periodically, the DAC expires flows and converts them into
    connections
    . A
    flow might be expired for two reasons: a new connection is initiated between the same pair of hosts and ports
    or the flow has been inactive for a time period greater than a user defined timeout (1 hour in our experimental
    setup). Flows are composed of packets, but connections are composed of request-response elements. Each
    connection is stored as part of a
    session
    . A session is a sequence of connections between the same host pairs.
    Service-specific information in sessions must be normalized before clustering for two reasons. First,
    classification of sessions becomes more robust and clustering algorithms can be independent of the type of
    service. Second, the space of ambiguities is too large to produce a signature for every possible encoding of
    attacks. By decoding service-specific information into a canonical form, normalization enables generation of
    a more compact signature set. A detection system must then first decode attack payloads before signature
    matching. This strategy is consistent with that employed by popular NIDS
    . We describe the particular
    normalizations performed in greater detail in Section
    The DAC finally transforms the normalized sessions into XML-encoded SSTs suitable for input to the
    clustering module. This step also assigns weights to the elements of the SST to highlight the most important
    attributes, like the URL in an HTTP request, and deemphasize the less important attributes, such as encrypted
    fields and proxy-cache headers in HTTP packets. The clustering module may use the weights to construct
    more accurate session classifications.
    3.2 Signature Generation Component
    The clustering module groups sessions and connections with similar attack profiles according to a similarity
    metric. We assume that sessions grouped together will correspond to a single attack type or variants of a well-
    known attack while disparate clusters represent distinct attacks or attack variants that differ significantly from
    some original attack. Effective clustering requires two properties of the attack data. First, data that correspond
    to an attack and its variants should be measurably similar. A clustering algorithm can then classify such data
    as likely belonging to the same attack. Second, data corresponding to different attacks must be measurably
    dissimilar so that a clustering algorithm can separate such data. We believe that the two required properties are
    unlikely to hold for data sets that include significant quantities of non-malicious or normal traffic. Properties
    4
      of normal traffic vary so greatly as to make effective clustering difficult without additional discrimination
    metrics. Conversely, malicious data contains identifiable structure even in the presence of obfuscation and
    limited polymorphism. Nemean’s use of honeynet data enables a reasonable number of meaningful clusters
    to be produced. While each cluster ideally contains the set of sessions or connections for some attack, we also
    presume that this data will contain minor obfuscations, particularly in the sequential structure of the data, that
    correspond to an attacker’s attempts to evade detection. These variations provide the basis for our signature
    generation step.
    The automata learning module constructs an attack signature from a cluster of sessions. A generator is
    implemented for a target intrusion detection system and produces signatures suitable for use in that system.
    This component has the ability to generate highly expressive signatures for advanced systems, such as regular
    expression signatures with session-level context that are suitable for Bro
    . Clusters that contain many
    non-uniform sessions are of particular interest. These differences may indicate either the use of obfuscation
    transformations to modify an attack or a change made to an existing attack to produce a new variant. Our
    signature generation component generalizes these transformations to produce a signature that is resilient to
    evasion attempts. Generalizations enable signatures to match malicious sequences that were not observed in
    the training set.
    4 Data Abstraction Component Implementation
    We have implemented prototypes of each Nemean component. While the Nemean design provides flexi-
    bility to handle any protocol, we focus our discussion on two specific protocol implementations, HTTP (port
    80) and NetBIOS/SMB (ports 139 and 445), since these two services exhibit great diversity in the number and
    types of exploits.
    4.1 Transport-Level Normalization
    Transport-level normalization resolves ambiguities introduced at the network (IP) and transport (TCP) layers
    of the protocol stack. We check message integrity, reorder packets as needed, and discard invalid or duplicate
    packets. The importance of transport layer normalizers has been addressed in the literature [
    . Building a
    normalizer that
    perfectly
    resolves all ambiguities is a complicated endeavor, especially since many ambiguities
    are operating system dependent. We can constrain the set of normalization functions for two reasons. First,
    we only consider traffic sent to honeynets, so we have perfect knowledge of the host environment. This
    environment remains relatively constant. We do not need to worry about ambiguities introduced due to DHCP
    or network address translation (NAT). Second, Nemean’s current implementation analyzes network traces off-
    line which relaxes its state holding requirements and makes it less vulnerable to resource-consumption attacks.
    Attacks that attempt to evade a NIDS by introducing ambiguities to IP packets are well known. Examples
    of such attacks include simple
    insertion attacks
    that would be dropped by real systems but are evaluated by
    NIDS, and
    evasion attacks
    that are the reverse
    . Since Nemean obtains traffic promiscuously via a packet
    sniffer (just like real a NIDS), these ambiguities must be resolved. We focus on three common techniques
    used by attackers to elude detection.
    First, an invalid field in a protocol header may cause a NIDS to handle the packet differently than the des-
    tination machine. Handling invalid protocol fields in IP packets involves two steps: recognizing the presence
    of the invalid fields and understanding how a particular operating system would handle them. Our imple-
    mentation performs some of these validations. For example, we drop packets with an invalid IP checksum or
    length field.
    Second, an attacker can use IP fragmentation to present different data to the NIDS than to the destina-
    tion. Fragmentation introduces two problems: correctly reordering shuffled packets and resolving overlap-
    ping segments. Various operating systems address these problems in different ways. We adopt the
    always-
    favor-old-data
    method used by Microsoft Windows. A live deployment must either periodically perform
    active-mapping
    or match rules with passive operating system fingerprinting. The same logic applies for
    fragmented or overlapping TCP segments.
    Third, incorrect understanding of the TCP Control Block (TCB) tear-down timer can cause a NIDS to
    improperly maintain state. If it closes a connection too early it will lose state. Likewise, retaining connections
    too long can prevent detection of legitimate later connections. Our implementation maintains connection state
    for an hour after session has been closed. However, sessions that have been closed or reset are replaced earlier
    5
       [ Pobierz całość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • wzory-tatuazy.htw.pl