An Approximate Parallel Multiplier with Deterministic Errors for Ultra-High Speed Integrated Optical Circuits Jun Shiomi 1, Tohru Ishihara 1, Hidetoshi Onodera 1, Akihiko Shinya 2, Masaya Notomi 2 1 Graduate School of Informatics, Kyoto University, Japan 2 NTT Nanophotonics Center / NTT Basic Research Laboratories, Japan 1
Beyond Optical Communication Background: Advancement in nanophotonic devices Enables on-chip interconnects for broadband commutation CPU Core Optical Network-on-Chip Network interface CPU CPU CPU Inter-chip optical network To higher hierarchy (e.g., server-to-server) Goal: Add functional unit to boost up on-chip communication E.g. pattern recognition 2
Approximate Parallel Multiplier with Nanophotonic Devices CPU Core Optical Network-on-Chip Network interface CPU CPU CPU Inter-chip optical network To higher hierarchy (e.g., server-to-server) Optical implementation of approximate parallel multiplier - For Machine Learning (ML), Approximate Computing (AC) - 11% deterministic error at the worst case Enables signal processing with ultra-low latency - Example (32-bit parallel multiplier) W/o approximation: > 800 ps (< 1.25 GHz) This work: 123 ps ( 8.1 GHz) 6.5 boost 3
Outline Background Parallel Multiplier Using Nanophotonic Devices Approximate Parallel Multiplier Overview Detailed Implementation Performance Evaluation Conclusion 4
Photonic Crystal-Based Optical Pass-Gate (OPG) [2] Directional coupler Length > 100μm 0: cross Electric signal 1: pass Light Light Modulator Length 1.3 μm < 1 ps Electric signal 1: pass 0: block < 1 ps Electric signal 0: pass 1: block n-inp p-inp < 1 ps < 1 ps [2] T. Ishihara, et al., International SoC Design Conference, 2016 5
OptoElectoric Conversion Delay Serial connection Cascade connection Electric signal Optical signal OptoElectric () Light speed ( 1 ps /gate) τ OPG conversion ( 25 ps) τ OE Reducing s on a critical path is a key to ultra-fast operation 6
Issues in Conventional Optical Parallel Multiplier 5-bit array multiplier Partial Product PP2 PP1 PP0 Optical Full Adder (FA) [2] PP3 FA X Y CI π not CO S PP4 FA FA CLA [2] T. Ishihara, et al., International SoC Design Conference, 2016 sum Issue: #(s) = (Bit width n) Large latency for large n 1τ OE = 25 ps 32τ OE = 800 ps 64τ OE = 1600 ps 40 GHz 1.25 GHz 625 MHz Too slow! 7
Outline Background Parallel Multiplier Using Nanophotonic Devices Approximate Parallel Multiplier Overview Detailed Implementation Performance Evaluation Conclusion 8
Basic Idea of the Proposed Approximate Multiplier X n Y n = 2 log 2 X n +log 2 Y n X n, Y n Z + n-bit integer Fixed point value Optical signal Electric signal X n Y n Approximate logarithm Approximate logarithm Add [2] 2n-bit integer Approximate antilogarithm Result (X n Y n ) Serial connections only! Light speed signal processing [2] T. Ishihara, et al., International SoC Design Conference, 2016 Three converters on a critical path for any bit width n 9
Concept of Approximate Logarithm [3] 2 log 2 X n +log 2 Y n 4 3 W/o approximation E.g. log 2 7 approximation MSB 0 1 1 1 3 2 1 0 LSB log 2 X n 2 1. Find max. digit that has 1 10 2 (= Priority encoder) 1 0 W/ linear approximation (This work) 1 2 4 7 8 X n 2. Simply return the remainder 11 2 (= Shifter) Approximate log 2 7 result: 10.110 2 (Fixed point) Approximate antilogarithm: Inverse function of the logarithm [3] J. N. Mitchell, IRE Transactions on Electronic Computers, 1962 10
Priority Priority Encoder-Based Approximate Logarithm (n = 4) 2 log 2 X n +log 2 Y n Integer part of log 2 X n Fractional part of log 2 X n High MSB 1 0 2 1 0 LSB 0: cross 1: pass x 3 x 2 0: block 1: pass x 1 Low x 0 0: pass 1: block Light source Light source Priority encoder Shifter Only one converter on a critical path for any bit width n 11
Integer part s Barrel Shifter-Based Approximate Antilogarithm (n = 4) 2 log 2 X n +log 2 Y n Carry from adder of fractional part MSB Multiplication results 7 6 5 4 3 2 1 0 LSB LSB 0: cross 1: pass 0 1 2 Barrel Shifter 0: block 1: pass MSB Light source MSB 2 1 0 LSB log 2 X n + log 2 Y n (Fixed point & Optical signal) s Fractional part One converter on a critical path 12
Outline Background Parallel Multiplier Using Nanophotonic Devices Approximate Parallel Multiplier Overview Detailed Implementation Performance Evaluation Conclusion 13
Performance Analysis Critical path Serial connections only! X n Approximate logarithm Add [2] Approximate antilogarithm Result (X n Y n ) Latency = 3τ OE + O nτ OPG 25 ps 1 ps τ OE = 25 ps Error: Deterministic [3] 11% (when X n = 3 2 i and Y n = 3 2 j ) 0% (when X n = 2 i or Y n = 2 j ) i, j Z + τ OPG = 1 ps τ OPG = 1 ps [2] T. Ishihara, et al., International SoC Design Conference, 2016 [3] J. N. Mitchell, IRE Transactions on Electronic Computers, 1962 14
Performance Comparison Latency Error Conventional (array multiplier) > nτ OE 0 Proposed 3τ OE + O nτ OPG 11%, 0% Conventional: > 800 ps (< 1.25 GHz) n = 32 Proposed: 123 ps ( 8.1 GHz) 6.5 boost More than 6. 5 faster w/ deterministic errors Application: Machine learning, approximate computing 15
Multiplication result (optical power) Verification Using Optoelectronic Circuit Simulator (n = 4) MSB 7-bit 6-bit 5-bit 4-bit 3-bit 2-bit 1-bit 0-bit LSB 0 s X= 0 0 0 1 Y= 0 0 0 1 XY= 0 0 0 0 0 0 0 1 X= 1 1 1 1 Y= 0 0 0 1 XY= 0 0 0 0 1 1 1 1 X= 1 1 1 1 Y= 1 1 1 1 XY= 1 1 1 0 0 0 0 0 1 ns 16
Conclusion and Future Work Approximate parallel multiplier using optical path-gates - Ultra-fast: 8.1 GHz More than 6.5 faster operation - Deterministic error: From 11% to 0% - Correct operation confirmed by optoelectronic circuit simulator Future work - Power & area evaluation - Comparison with CMOS-based parallel multiplier 17