Paper Title
High Frequency Trading System-on Chip (HFT-SOC with FPGA/GPU Fabrics)

Abstract
This paper discusses programmable System-on-Chip (SoC) design for High Frequency Trading System implementation on a single chip using FPGA and GPU fabrics. The single-chip approach can be of particular value to HFT systems developers facing stringent time-to-market, cost, performance, design reuse, and longevity requirements. The proposed new design of SoC with embedded processors in field-programmable gate array of FPGAs, GPUs and CPUs runs collaboratively for High Frequency Trading. The computational load of trading decision is divided over multiple processing FPGA nodes operating in parallel to reduce the computation load on a single FPGA node. However, this requires a dedicated Trade Processing Unit (TPU) a controller that can distribute the Market Data to and from the nodes connected with SoCs at wire-speed. This paper concentrates on the system topologies and on the implementation aspects of the TPU and the allied trade decision making sub modules such as Complex Event Processing (CEP), Trading Engine (TE), Derivatives Analytics Engine (DAE), Pre-Trade and Post Trade Risk Engine (RE) and Order Processing Engine (OPE) all on FPGAs and GPUs on a single SoC because of its integrated circuit like performance and high-grain programmability. The performance of this HFT SoC has potential for improvement from 100 to 1000 times compare to traditional multi-core processors-based systems which run on scheduling of OS. This HFT on single chip design which does not use any OS shows that our design can combine the benefits of the specialization of FPGA, the parallelism of GPU, and the scalability of processors, memory controllers, and peripherals with customizable FPGA fabric in a single SoC. The TPU’s hardware implementation designed here is a modular, highly parameterizable design for 10-gigabit Ethernet. The design can be verified by simulations and synthesizable test-benches. The design also can be synthesized on different FPGA devices while varying parameters to analyze the achieved performance. High-end FPGA devices, such as Altera Stratix family, can meet the target processing speed of 10-gigabit Ethernet. Here the latency of TPU can be comparable to a typical switch. An advanced design of CEP Engine at wire speed is proposed for processing Market Data to identify the complex events, patterns and matching with right strategies using semantics by a given market event and a set of financial strategies, find all strategies satisfied by the identified market event. The design of the computationally intensive module DAE is implemented by taking full advantage of the FPGA architecture by building the financial computing applications for parallelism, such as option pricing, Value at Risk and Credit Risk analysis are intimately mapped to hardware. All of the DAE module applications use computationally expensive Monte- Carlo simulations with different work load relies on the average result of thousands of independent stochastic paths, massive parallelism can be adopted to accelerate the computation with FPGAs. Monte-Carlo Simulation Engine MCE is designed exclusively on separate FPGA to cater computational needs for DAE. All other sub-modules of trading can be implemented on FPGAs with little easier than DAE module and sub-microsecond latency can be achieved. Keywords - Programmable Syste, Carlo Simulation Engine