Optimal network architectures for minimizing average distance in k-ary n-dimensional mesh networks. Modeling and analysis of rayleigh fading channels using stochastic network calculus. A high-end reconfigurable computation platform for nuclear and particle physics experiments. Computing in Science and Engineering , 13 2 , March-April FPGA-based cherenkov ring recognition in nuclear and particle physics experiments.
Custom microcoded dynamic memory management for distributed on-chip memory organizations. Cooperative communication based barrier synchronization in on-chip mesh architectures. Power-efficient tree-based multicast support for networks-on-chip. Memory architecture and management in an NoC platform. The promises and limitations of 3-D integration. Adaptively reconfigurable controller for the flash memory. In Book of Flash Memory. InTech, ISBN: Realization and performance comparison of sequential and weak memory consistency models in network-on-chip based multi-core systems.
Stochastic coverage in event-driven sensor networks. Integrated Circuits and Systems. Springer, January Run-time partitioning of hybrid distributed shared memory on multi-core network-on-chips. Buffer optimization in network-on-chip through flow regulation. Area and performance optimization of barrier synchronization on multi-core network-on-chips. Multi-FPGA implementation of a network-on-chip based many-core architecture with fast barrier synchronization mechanism.
A reconfigurable fault-tolerant deflection routing algorithm based on reinforcement learning for networks-on-chip. A framework for designing congestion-aware deterministic routing. Handling shared variable synchronization in multi-core network-on-chip with distributed memory. A worst case performance model for TDM virtual circuit in nocs. FoN: Fault-on-neighbor aware routing algorithm for networks-on-chip. HetMoC: Heterogeneous modeling in systemc. Supporting efficient synchronization in multi-core NoCs using dynamic buffer allocation technique.
Inter-process communication using pipes in FPGA-based adaptive computing. Scalability of weak consistency in NoC based multicore architectures. Reducing FPGA reconfiguration time overhead using virtual configurations. Theorem proving techniques for formal verification of noc communications with non-minimal adaptive routing. Supporting distributed shared memory on multi-core network-on-chips using a dual microcoded controller. Optimal regulation of traffic flows in networks-on-chip.
FPGA-based adaptive computing for correlated multi-stream processing. Constrained global scheduling of streaming applications on MPSoCs. A reconfigurable design framework for FPGA adaptive computing. Scalability of transaction counter based relaxed consistency models in NoC based multicore architectures.
Speedup analysis of data-parallel applications on multi-core NoCs. Trends of terascale computing chips in the next ten years. Run-time partial reconfiguration speed investigation and architectural design space exploration. Trigger algorithm development on fpga-based compute nodes. Hartmanny, and Wolfgang Nebel. High-level estimation and trade-off analysis for adaptive real-time systems. Scalability of network-on-chip communication architecture for 3-d meshes. Flow regulation for on-chip communication. Priority based forced requeue to reduce worst-case latency for bursty traffic.
Analytical evaluation of retransmission schemes in wireless sensor networks. Towards hierarchical cluster based cache coherence for large-scale network-on-chip. Hardware design and synthesis in ForSyDe. Models of computation for distributed embedded systems. Resource allocation for quality of service in on-chip communication. A flow regulator for on-chip communication. Modeling communication with synchronized environments. Fundamenta Informaticae , 86 3 , October Performance analysis of reconfiguration in adaptive real-time streaming applications. Energy efficient streaming applications with guaranteed throughput on MPSoCs.
System-on-an-FPGA design for real-time particle track recognition and reconstruction in physics experiments. ATCA-based computation platform for data acquisition and triggering in particle physics experiments. Network-on chip micro-benchmarks. Embedded Systems Design , September TDM virtual-circuit configuration for network-on-chip.
Deterministic worst-case performance analysis for wireless sensor networks. C-based design of embedded systems - editorial. Analysis of traffic splitting mechanisms for 2D mesh sensor networks. SML-Sys: A functional framework with multiple models of computation for modeling heterogeneous system. Design Automation for Embedded Systems , 12 1 , June A multiprocessor system-on-chip for real-time biomedical monitoring and analysis: ECG prototype architectural design space exploration.
Cluster-based simulated annealing for mapping cores onto 2d mesh networks on chip. Modelling adaptive systems in ForSyDe. Application and verification of local non-semantic-preserving transformations in system design. Low-pwer and error protection coding for network-on-chip traffic. Traffic splitting with network calculus for mesh sensor networks. Slot allocation using logical networks for TDM virtual-circuit configuration for network-on-chip.
Network calculus applied to verification of memory access performance in SoCs. Performance analysis and design space exploration for high-end biomedical applications: Challenges and solutions. Admitting and ejecting flits in wormhole-switched networks on chip. Synchronization after design refinements with sensitive delay elements. EWD: A metamodeling driven customizable multi-moc system modeling framework.
Increasing NoC performance and utilisation using a dualpacket exit strategy. Layered switching for networks on chip. An analytical approach for dimensioning mixed traffic networks. Towards open network-on-chip benchmarks. Improvements of performance and use of buffers in NoCs using dual packet exit. LNCS A synchronization algorithm for local temporal refinements in perfectly synchronous models with nested feedback loops.
A network-based system architecture for remote medical applications. Seceleanu, A. On-chip distributed architectures. Adaptive power management for the on-chip communication network. Towards performance-oriented pattern-based refinement of synchronous models onto NoC communication.
A high level power model for the Nostrum NoC. Flexible bus and NoC performance analysis with configurable synthetic workloads. A multiprocessor system-on-chip for real-time biomedical monitoring and analysis: Architectural design space exploration. A new protocol for electing cluster head based on maximum residual energy.
Models of computation for networks on chip. Communicating with synchronized environments. Connection-oriented multicasting in wormhole-switched networks on chip. Nocsim: A NoC Simulator. Refining synchronous communication onto network-on-chip best-effort services. Springer Verlag, Evaluation of onchip networks using deflection routing.
Shukla, and Axel Jantsch. An algorithm of electing cluster head in beacon node distributions based on maximum residual energy. Singh, Ingo Sander, and Axel Jantsch.
System level verification of digital signal processing applications based on the polynomial abstraction technique. Refinement of a perfectly synchronous communication model onto Nostrum NoC best-effort communication. Traffic configuration for evaluating networks on chips. A power efficient flit-admission scheme for wormhole-switched networks on chip. Models of computation and languages for embedded system design.
Special issue on Embedded Microelectronic Systems; Invited paper. Wireless network-on-chips as autonomous systems: A novel solution for biomedical healthcare and space exploration sensor-networks. Models of embedded computation. CRC Press, Invited contribution. Power analysis of link level and end-to-end data protection on networks on chip.
Models of computation in the design process. IEE, Simulation of real home healthcare sensor networks utilizing ieee Feasibility analysis of messages for on-chip networks using wormhole routing. Flit admission in on-chip wormhole-switched networks with virtual channels. Flit ejection in on-chip wormhole-switched networks with virtual channels. Low-power and error coding for network-on-chip traffic. A study on the implementation of 2-D mesh based networks on chip in the nanoregime.
System design for dsp applications in transaction level modeling paradigm. In Proc.
Design Automation Conf. The platform as interface in a SoC design curriculum. Error-tolerant interconnect schemes. Kluwer Academic Publisher, April Special issue on networks on chip - guest editor's introduction. Journal of Systems Architecture , 50 , February Introduction to special issue on networks on chip. Guaranteed bandwidth using looped containers in temporally disjoint networks within the Nostrum network on chip. Polynomial abstraction for verification of sequentially implemented combinational circuits. Networks on chip: Approaches and challenges.
VTT Electronics, The Nostrum backbone - a communication protocol stack for networks on chip. System modeling and transformational design refinement in ForSyDe. Pamunuwa, J. Zheng, M. Millberg, A. Layout, performance and power trade-offs in mesh-based network-on-chip architectures. A group of subword instructions and design issues for network processing RISC cores. Evaluating NoC communication backbones with simulation. Verification of design decisions in forsyde. A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip.
NoCs: A new contract between hardware and software. Invited keynote. Development and application of design transformations in ForSyDe. Carloni, and P. Jung, M. Petracca, and L. Di Guglielmo, C. Pilato, and L. Mantovani, M. Petracca, M. Casu and L. Yoon, N. Concer, M. Petracca and L. Collins and L. Szczodrak, O. Gnawali, and L. Szczodrak, Y. Yang, D. Cavalcanti, and L. Park, M. Liu and L. Bombieri, H. Liu, F. Fummi, and L. Whelihan, J. Hughes, S. Sawyer, E. Robinson, M. Wolf, S. Mohindra, J.
Mullen, A. Klein, M. Beard, N. Bliss, J. Chan, R. Hendry, K. Bergman and L. Gorlatova, R. Margolies, J. Zhu, B. Vigraham, M. Szczodrak, L. Casale-Rossi and A. Sangiovanni-Vincentelli and L. Carloni and B.
Foundations and Trends in Electronic Design Automation, Volume 2
Courtois and H. Domic and J. Sturcken, E. O'Sullivan, N. Wang, P. Herget, B. Webb, L. Romankiw, M. Petracca, R. Davies, R. Fontana, G. Decad, I. Kymissis, A. Peterchev, L. Shepard A 2. Jung, R. Neill and L. Best Paper Award. Concer and L. Sturcken, M. Petracca, S. Warren, P. Peterchev, and K. Ramini, D. Bertozzi, and L. Liu, M. Stanje, P. Miller, J. Zhu, A. Smith, O. Winn, R.
Electronic Design Automation
Sarik, M. Best Student Demo Award. Szczodrak and L. Chan, G. Bergman, and L. Concer, A. Vesco, R. Scopigno, and L. Neill, L. Shabarshin, V. Sigaev, and S. Warren, L. Liu, I. Diakonikolas, M. Zhu, G. Stanje, R. Sarik, Z. Noorbhaiwala, P. Hendry, E. Robinson, V.
Gleyzer, J. Chan, L. Carloni, N. Bliss, and K. Leonardi, A. In this case, the subpattern of the merged state transitions for a target pattern is not always the subpattern for another target pattern. Therefore, in order to share the state transition towards a shared state s i , the state transition from state s i cannot be merged with the state transition towards state s i. If the state transition from an output state is performed, the output state has been reached.
On the other hand, for the merged state transitions, a sequence of characters is compared with a subpattern at a time. When the state transition from an output state is merged with any other state transition towards the output state, there is no way to know whether the output state has been reached or not. Therefore, the state transition from an output state cannot be merged with the state transition towards the output state. If a state transition st j is merged with other state transitions, the state transition st j can be the element of a set for comparing a sequence of characters at a time.
If the state transition is merged more than once, the state transition can be the element of another set. In this case, the state transition can be performed several times. Therefore, the condition that a state transition should be merged only once should be met. Fig 3 describes a procedure that extracts sets of state transitions to be merged.
In stage stage i , a state transition st k of state state j is stored into a temporary variable st. The next state transition is obtained with st by calling procedure NextStateTransitionfrom. Then, variable st is replaced with the next state transition. This process will be iterated until the maximum number of state transitions can be merged.
It is noted that the minimum number of state transitions included in set Y can be one. After finishing while loop, the final set Y becomes a new element in Z. Finally, all sets of state transitions to be merged Z are returned.
Fig 4 describes an example of the pipelined NFA using merged state transitions. In Fig 4 , subpatterns dle, no , and rt are adopted to merge state transitions. Compared to Fig 1 a , the state transition from state s 0 is merged with the state transition from state s 1 because state s 1 has only one valid state transition towards state s 2.
The state transition from state s 9 is not merged because state s 9 is the output state. Fig 5 shows a circuit diagram of the implementation for the pipelined NFA using merged state transitions in Fig 4. In order to support merged state transitions at a time, the input character is decoded, and then the decoded output bits are shifted. In the left part of Fig 5 , registers are adopted for shifting the decoded output bits. The outputs from the registers are inputted into the comparator of merged state transitions.
For example, if three state transitions are merged, three decoded output bits are inputted for the comparison with three input characters at a time. The comparators for the merged state transitions are implemented using LUTs. Compared to the circuit diagram in Fig 2 , multiple decoding bits are inputted in several LUTs in Fig 5 , where the required number of LUTs is reduced from nine to six. When state transitions are merged, the state information can be shifted using the chain of FFs.
Because the state information is stored in FFs each time, the number of FFs for storing the state information does not change, compared to that shown in Fig 2. On the other hand, several registers for shifting the output bits from the character decoder are required. Actually, because all ASCII characters are not shown in target patterns, the increased number of FFs can be smaller than the maximum number. When merging state transitions, whereas the number of used LUTs can be reduced, the number of used FFs can be increased. Therefore, the idea with the merged state transitions reduces the number of used LUTs greatly with slightly increased number of used FFs.
As shown in the previous section, the identification number of the longest matched pattern is provided with the priority encoder. In order to decrease hardware complexity, a pipelined priority encoder can be adopted, so that high operating frequency is achieved. On the other hand, according to the target patterns, a specific priority encoder can be implemented. For example, when the lengths of several patterns are equal, the output states for the patterns are located in a stage. Because only one state in each stage can be the current state, the hardware complexity of implementing the priority encoder can be reduced.
In this case, in order to show the identification of the stage for the matched longest pattern, the index of the stage is provided using the priority encoder. In addition, the encoder for each stage is required for providing the index of the output state of the matched pattern. Therefore, the identification number of the pattern for a state can be the combination with the index of the state in a stage and the index of the stage. In this case, the size of the identification number depends on the distribution of pattern lengths.
Publications by Axel Jantsch
In our implementation of the pipelined priority encoder, instead of designing the specific pipelined priority encoder according to different set of target patterns, a hierarchical design of the pipelined priority encoder is adopted as follows: firstly, a unit block of the pipelined priority encoder is shown in Fig 6 a , where bits s 0 , s 1 , s 2 , and s 3 are the information of four output states.
Multiple unit blocks are adopted in the first pipeline stage. If an output state s i is reached, signal s i in Fig 6 a can be true; otherwise, signal s i is false. The priority encoder in Fig 6 a is the combinational logic circuit with four-input bits. A four-input OR gate, which is used to generate an output signal matched with signals s 0 , s 1 , s 2 , and s 3. Signal matched indicates whether there are any matched patterns or not. There are two reasons why four bits are inputted; firstly, because each LUT has one output bit, four input bits can be encoded with two output bits with two LUTs.
Secondly, considering the number of inputs in an LUT of commercial FPGAs, four input bits can be sufficient to generate signal matched. Except for the first stage, multiple-bit indexes from a previous stage are inputted. Fig 6 b describes a block for the stage with four n -bit indexes and matched i signals from the previous stage. Output signal matched indicates whether there are any matched patterns by ORing input signals matched s from the previous stage. Using the 4-to-2 priority encoder, two-bit selection signal is provided for the n -bit multiplexor, which can transmit the matching index with the highest priority from the previous stage.
The matching index of this stage is provided by concatenating the two bits from the 4-to-2 priority encoder high-order bits and n bits from the multiplexor low-order bits. Based on the detail explanation of Fig 6 mentioned above, Fig 7 shows a diagram of the pipelined priority encoder. Even though the information of 32 output states are shown in Fig 7 , the hierarchical and regular design can be possible. Therefore, the 5 leftmost bits can be outputted for identifying matches with 32 patterns in Fig 7.
There are several issues for constructing NFA and implementing hardware, which will be discussed in the following subsections. Fig 8 shows the flow to obtain FPGA configuration data from a rule set. In order to construct a pipelined NFA, patterns are extracted from a rule set. Each character in a pattern is mapped onto a valid state transition, which is similar to the goto function of the Aho-Corasick algorithm.
On the other hand, the pipelined NFA does not adopt the failure function. In the synthesis and implementation step, the resource usage can be reported. On the other hand, when a target rule set is updated, the generated HDL code should be changed, where the code is synthesized and implemented. Because the configuration time of an FPGA can be great, several suitable solutions are required. In this case, if a redundant FPGA is used, there is no need to stop the string matching engine when updating the rule set. Then, the newly configured device can run.
At the same time, the old FPGA can be a redundant device. In addition, the partial reconfiguration can be helpful to solve the problem of updated rule sets. Now, major FPGA vendors support the partial reconfiguration, as shown in [ 30 , 31 ]. By adopting the partial reconfiguration flow, the functionality of the changed patterns can be changed on the fly. Considering the solutions for updated rule sets mentioned above, it seems that the FPGA-based string matching can be realistic in the commercial FPGAs.
The program ran on a Linux machine with Centos 5. Eight different rule sets were adopted from Snort [ 33 ] v2. Table 1 enumerates several characteristics of rule sets. The distributions of pattern lengths for rule sets could be different from each other. Therefore, it was concluded that the adopted rule sets were sufficient for evaluating the proposed string matching scheme.
In order to calculate both hardware overhead and maximum operating frequency according to the maximum number of merged state transitions for a state denoted as M , evaluations were performed by sweeping M. The number of FFs for shifting the decoded output bits of character inputs increased with M in all rule sets.
In the rule sets with small number of target patterns, the ratio of FFs for shifting the decoded output bits to those for implementing states in the pipelined NFA can be great. Therefore, for the rule sets with small number of target patterns such as chat , exploit , and policy , the ratios increased sharply with M , as shown in Fig 9 a. Considering the structure of a pipelined NFA like a tree, the ratio of state transitions to be merged can be great in the rule sets with the long average pattern length such as oracle and web-client. In this implementation, five decoded output bits of character inputs for merged state transitions and the value of a state can be ANDed in an LUT.
In web-client , due to the repeated subpatterns with 0…0 in each pattern, it was analyzed that the combinational logic circuit of the comparators for performing merged state transitions was optimized for low hardware overhead. Therefore, it was concluded that the maximum number of inputs in an LUT was related to the hardware overhead by merging state transitions. In addition, it was expected that the amount of reduced hardware overhead can depend on a set of target patterns.
After analyzing the synthesis report for web-client , large routing delay was found in the circuit that transmitted shifted decoded output bits into the comparator for merged state transitions. Due to the large pattern length in web-client , it was concluded that the routing complexity for large M can be great. Considering the analysis mentioned above, there was a threshold point of M , which was related to the structure of an LUT.
Table 2 shows several evaluation data of the proposed string matching scheme. A lot of characters were in shared common prefixes of each rule set, where the ratios of characters in shared common prefixes to all characters for a rule set were ranged from In order to compare with the previous pipelined NFA, the number of moved state transitions to be merged was computed. As shown in Table 2 , Therefore, it was expected that there were many state transitions to be merged for reducing hardware overhead.
On the other hand, the times required for generating a HDL code and synthesizing the code were estimated for each rule set. Assuming that the number of target patterns was N , the time complexity of constructing an NFA and extracting merged state transitions can be O N. Considering data in Table 2 and the time complexity, it was expected that the time required for generating an HDL code was not be great for any rule set. In addition, it was analyzed that the time required for synthesis was not great for the adopted rule sets. In addition, the maximum operating frequency F of each string matching scheme is shown.
Especially, in web-client , due to the repeated subpatterns mentioned above, it was expected that the required number of LUTs was optimized by the synthesis tool. Therefore, compared to the case in cam , the required number of LUTs was not decreased only for web-client. Except for the case, the required numbers of LUTs were smallest in proposed , compared to other schemes. Especially, compared to the cases of the pipelined NFA string matching scheme in [ 19 ], the required numbers of LUTs were decreased by On the other hand, it was noted that the required numbers of FFs were increased for shifting the decoded output bits of character inputs and implementing the pipelined priority encoder.
In addition, considering several characteristics in Table 1 and experimental data in Table 3 , it was concluded that the ratio of the increased FFs was negligible as the numbers of states and state transitions were great. With this comparison and data in Fig 9 a , it was analyzed that most additional FFs was caused by the implementation of the pipelined priority encoder for large rule sets such as spyware and web-client.
These three schemes with high F s adopted the same pipelined priority encoder for each rule set. Even though cam , dfa , and ppfac required the small numbers of FFs, F s were not high.
- Navigation menu.
- ACM TC-FPGA Bibliography for FPGAs and Reconfigurable Computing;
- Volume 1, Number 1/2, 2006.
In addition, the proposed pipelined NFA-based string matching scheme provided high F s with small deviation, where the throughput for one character input at a time can be reached up to 5. In order to know the design density and distribution, the data about the logic distribution are added. If logic clusters contained insufficient logic resources, the amount of inter-cluster routing resources needed for routing will be great. Therefore, the complexity of wiring and interconnections can be predicted.
The main reason of the low ratios was because the number of FFs was much smaller than the number of LUTs. Considering the low ratios of proposed , the additional routing resources between clusters in proposed can be required. Due to the largely decreased number of used LUTs, the ratios can be lowered. However, for large rule sets such as spyware and web-client , due to the low ratio of increased FFs, it was expected that the additional routing resources between clusters could be small.
This paper proposes a pipelined NFA-based string matching scheme with a new technique called merged state transitions. In addition, the pipelined priority encoder is adopted in order to maximize the operating frequency. The proposed string matching scheme is evaluated based on realistic experimental environments using the automatically generated RTL code, commercial synthesis tool, and state-of-the-art FPGA. Experimental data shows that the proposed string matching scheme can reduce the number of LUTs greatly and achieve high throughput per one character input up to As shown in [ 28 , 29 ], because the pipelined NFA-based string matching scheme can process multiple chunks of characters in parallel, it is expected that throughput can be enhanced by equipping multiple instances.
Therefore, the proposed string matching scheme can be extended. Considering the conceptual idea and experimental data, it is concluded that the proposed pipelined string matching scheme can be helpful to achieve high performance with low hardware cost. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
National Center for Biotechnology Information , U. PLoS One. Published online Oct 3. Yongtang Shi, Editor. Author information Article notes Copyright and License information Disclaimer. Competing Interests: The authors have declared that no competing interests exist. Conceptualization: HJK. Data curation: HJK. Funding acquisition: HJK. Methodology: HJK. Project administration: HJK. Software: HJK.