Enabling New Computation Paradigms with HyperFET - An Emerging Device

Wei-Yu Tsai, Xueqing Li, Member, IEEE, Matthew Jerry, Baihua Xie, Nikhil Shukla, Huichu Liu, Member, IEEE, Nandhini Chandramoorthy, Matthew Cotter, Member, IEEE, Arijit Raychowdhury, Senior Member, IEEE, Donald M. Chiariulli, Member, IEEE, Steven P. Levitan, Fellow, IEEE, Suman Datta, Fellow, IEEE, John Sampson, Member, IEEE, Nagarajan Ranganathan, Fellow, IEEE, and Vijaykrishnan Narayanan, Fellow, IEEE

Abstract—High power consumption has significantly increased the cooling cost in high-performance computation stations and limited the operation time in portable systems powered by batteries. Traditional power reduction mechanisms have limited traction in the post-Dennard Scaling landscape. Emerging research on new computation devices and associated architectures has shown three trends with the potential to greatly mitigate current power limitations. The first is to employ steep-slope transistors to enable fundamentally more efficient operation at reduced supply voltage in conventional Boolean logic, reducing dynamic power. The second is to employ brain-inspired computation paradigms, directly embodying computation mechanisms inspired by the brains, which have shown potential in extremely efficient, if approximate, processing with silicon-neuron networks. The third is "let physics do the computation", which focuses on using the intrinsic operation mechanism of devices (such as coupled oscillators) to do the approximate computation, instead of building complex circuits to carry out the same function. This paper first describes these three trends, and then proposes the use of the hybrid-phase-transition-FET (Hyper-FET), a device that could be configured as a steep-slope transistor, a spiking neuron cell, or an oscillator, as the device of choice for carrying these three trends forward. We discuss how a single class of device can be configured for these multiple use cases, and provide in-depth examination and analysis for a case study of building coupled-oscillator systems using Hyper-FETs for image processing. Performance benchmarking highlights the potential of significantly higher energy efficiency than dedicated CMOS accelerators at the same technology node.

Index Terms—HyperFET, steep slope, coupled oscillators, neural network, spiking neuron, image processing, approximated processing

1 INTRODUCTION

For the last few decades, power has been a major constraint for very-large integrated circuits. In the past,

- W.Y. Tsai, X. Li, N. Chandramoorthy, J. Sampson, V. Narayanan are with the Department of Computer Science and Engineering, 354 IST Building, the Pennsylvania State University, University Park, PA 16802. E-mail: [wyt114, lxueq, nic5090, sampson, vijay]@cse.psu.edu.
- M. Cotter is with the Applied Research Laboratory, P.O. Box 30, State College, PA 16804-0030. E-mail: mj324@ece.psu.edu.
- M. Jerry, B. Xie, and N. Shukle are with the Department of Electrical Engineering, the Pennsylvania State University, University Park, PA 16802. E-mail: [mjj182, baihua.xie, nss152]@psu.edu.
- H. Liu is with the Intel Corporation, Santa Clara, CA 95054. E-mail: hlx249@psu.edu.
- S. Datta is with the Department of Electrical Engineering, University of Notre Dame, 271 Fitzpatrick Hall, Notre Dame, IN 46556. E-mail: sdatta@nd.edu.
- A. Raychowdhury is with the Georgia Institute of Technology, KLAUS 2362, 266 Ferst Drive, Atlanta, GA 30332. E-mail: arijit.raychowdhury@ece.gatech.edu.
- D.M. Chiariulli is with the Department of Computer Science, University of Pittsburgh, PA 15260. E-mail: don@cs.pitt.edu.
- S.P. Levitan is with the Departments of Electrical and Computer Engineering and Computer Science, University of Pittsburgh, 3700 O'Hara St., Pittsburgh, PA 15261. E-mail: levitan@pitt.edu.
- N. Ranganathan is with the Department of Computer Science and Engineering, University of South Florida, Tampa, FL 33620. E-mail: ranganat@mail.usf.edu.

 decreases in chip functionality were paid for through lowering the supply voltage and reducing transistor capacitance through the scaling of CMOS technologies. However, with the end of Dennard scaling [4], further reduction of the supply voltage to reduce the power in Boolean logic has become challenging because of increasing leakage power with the $\geq 60 \text{ mV/decade}$ subthreshold slope (SS) of CMOS devices.

Consequences of this include high cooling cost in high-performance computation nodes, and limited operation time in portable battery-powered systems. Furthermore, the resultant shift in the economics of the virtuous cycle of investment in future process nodes holds back further reduction of cost per function. In response to these challenges, there has been rising interest in research on a collection of new devices with $< 60 \text{ mV/decade}$ SS and new architectures with higher power efficiency, as shown in Fig. 1.

The goal of steep-slope devices is to further lower the power consumption by lowering the supply voltage for lower dynamic power while keeping low leakage current and sufficient ON-current for driving capability [5], [6], [7], [8]. Reported research on steep-slope devices include tunneling FETs (TFETs) [1], negative-capacitance FETs (NCFETs) [9], and also metal-insulator-transition (MIT) FETs such as the Hyper-FET [2]. The most direct application scenario for these steep-slope devices is similar to that of the conventional CMOS, in that they could be used as Boolean logic device with ON-OFF drain-source by the gate input control. Meanwhile, it is also noted that those devices may exhibit unidirectional conduction [10], [11], hysteresis [2], [9], non-volatility

Manuscript received 26 Sept. 2015; revised 24 Nov. 2015; accepted 25 Dec. 2015. Date of publication 18 Jan. 2016; date of current version 8 Apr. 2016. Recommended for acceptance by K. Chakrabarty. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org, and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TMCS.2016.2519022
other second-order considerations such as different device capacitance characteristics [14]. While the HyperFET can be used directly in this fashion as a MOSFET replacement, this is not the primary focus of this work, and, due to similarity to existing CMOS approaches, will be covered in limited detail.

On the emerging architecture, rather than device, front, one driving question is what forms computation can and should take going forward. In particular, there has been a renaissance in domain-specific processing, especially in graphics and computer vision, increasing acceptance of specialized accelerators as part of general purpose systems, and a willingness to embrace new models. One such architecture is “brain-inspired” computation, such as those used in neuro-morphic [15] and other approximate computing platforms [16], [17], [18]. In this paper, we will show that the Hyper-FET based spiking neurons, compared with conventional integrate-and-fire (IAF) neurons, are much more efficient in the similar function with much lower area cost.

Another attractive feature of some non-Boolean architectures is the notion that they can “let the physics do the computing” [19], [20] and, in so doing, achieve significant efficiency gains so long as the problem can be specified in a manner that matches the physical phenomenon. One such class of non-Boolean architectures for computation is sets of weakly-coupled oscillators. When a number of oscillators are coupled together, they will synchronize if their initial states are sufficiently close. Such synchronized oscillation, namely an attractor basin function, is observed across mechanical (e.g., pendulum), electrical (e.g., electronic oscillators) and human neural systems (e.g., neural-oscillators). These synchronized oscillatory systems have been shown to possess associative computational capabilities [19], [21]. In this paper, we will show that the Hyper-FET based coupled oscillators are capable of forming area-efficient and power-efficient computation primitives for a range of applications, especially in image processing. Detailed device operating mechanism, circuit and architecture design, and performance evaluation will also be provided in this paper.

To ensure that investments in these new architectures and devices yield truly efficient systems, co-design of both devices and architecture is required. In this paper, we will focus on Hyper-FET based device modeling, circuit and architecture design, showing the potential of enabling new computation paradigms for higher power efficiency. The properties of circuits designed using these new devices are well-matched to the demands of existing algorithms in image processing and other domains. And device-circuit-algorithm co-design is expected to bring even more benefits to these applications in terms of functionality, power efficiency, etc., with more degree of optimizations.

The remainder of the paper proceeds as follows. Section 2 includes the background of the Hyper-FET devices. Section 3 describes how Hyper-FET-based spiking neurons and networks are constructed. Section 4 shows the Hyper-FET-based oscillators, and how oscillator networks’ synchronization behaviors can perform computations. Section 5 presents a case study in architecture and device co-design in the form of the implementation of a configurable oscillator network, and provides circuit-level validation that the tunable network effectively approximates a desirable family of mathematical functions. Section 6 presents the system-level approach to building a coprocessor fabric out of these tunable oscillator primitives, and how problems can be mapped to a single tile. Section 7 evaluates the computing efficiency on oscillator arrays compared to CMOS-based accelerators. Section 8 discusses related work and Section 9 concludes.

2 INTRODUCTION TO HYPER-FET DEVICES AND APPLICATIONS

The key novel feature of a HyperFET is the integration of $\text{VO}_2$, a resistive switching device (RSD), with the transistor.
The VO\textsubscript{2} material, i.e. Vanadium Dioxide, is an Insulator-to-Metal transition (IMT) material that exhibits strong correlation of the resistance with external perturbation, such as temperature, pressure, and electrical stimulus [22], [23].

Fig. 2a shows the voltage applied at the VO\textsubscript{2} device versus the current flowing through it, which has been proved to show a sharp change in resistivity up to five orders in magnitude at \(340\text{ K}\) [24].

Circuit-level simulations employ a Verilog-A model. The Verilog-A model for emulating the rapid resistance transition characteristics is based on the calibrated characteristics of the fabricated VO\textsubscript{2} oscillator. The Verilog-A model is as follows:

```verilog
module VO2(a, b);
  inout a, b;
  electrical a, b;
  parameter real R1 = 1k;
  parameter real R2 = 50k;
  parameter real V1 = 5.9;
  parameter real V2 = 0.5;
  parameter real tt = 100n;
  parameter real ini\_type = 0;

  real type;
  real R;

  analog begin
    @(initial\_step)
    begin
      type = ini\_type;
    end
    @(cross(V(a, b) - V1)) type = 0;
    @(cross(V(a, b) - V2)) type = 1;

    I(a, b) <+ V(a, b)
    / transition(type ? R2 : R1, 0, tt);
  end
endmodule
```

The device model shows good agreement with the experimental results of the relaxation oscillator and the coupled oscillator described in Section 4.1. By using a resistor [23] or a MOSFET (as a current source) [3] in series to induce a negative feedback, this electrically induced phase transition in VO\textsubscript{2} can be modulated dynamically, resulting in an oscillation between high and low resistive states. There are also other approaches that model the similar resistive switching behaviors [25], [26].

Fig. 2b illustrates the schematic of the experimental Hyper-FET consisting of a two-terminal VO\textsubscript{2} device in series with the source of a Si n-type MOSFET. In the experimental Hyper-FET setup, the VO\textsubscript{2} device is configured as an external device connecting in series with the MOSFET. The applied gate control voltage of the MOSFET modulates the channel energy barrier, and electrically triggers the abrupt state transition of the VO\textsubscript{2} material. Such abrupt resistivity change of VO\textsubscript{2} modulates the drain-source current (\(I_{DS}\)) flowing through the MOSFET, induces a negative differential resistance (NDR) across VO\textsubscript{2} that results in internal voltage amplification achieving a steep-slope characteristic which further enhances the Hyper-FET performance beyond that of a conventional MOSFET.

Although not shown in the concept schematic in Fig. 2b, Hyper-FETs could also make use of FinFET technology to enable multi-fin structure. Figs. 2c and 2d plot the n-type and p-type transfer characteristics (\(I_{DS} - V_{GS}\)) characteristics comparison with Si MOSFET with fin number equal to 3. [2]

It is also noted that the hysteresis in Hyper-FET \(I_{DS} - V_{GS}\) curves may result in hysteresis turn-on/turn-off behavior in logic gates, and further, a more complex delay evaluation. Nevertheless, it has been revealed that hysteresis logic transfer behavior could be of great benefit when employed for better noise immunity [13]. Further exploration of such Hyper-FET hysteresis behavior in digital circuits, though not covered in this paper, shows more potential of applications in digital logic designs.

More introductions to the device fabrication could be found in [2]. The metal and insulator states resistance values would be determined by the dimensions of device width, length, and thickness, while the voltage conditions...
determined by the device length. Significant challenges remaining include fabrication of large device arrays with limited variability. This challenge must be confronted in two areas: At the growth level uniform film properties must be controlled across the wafer. Equally as important will be process optimization into to eliminate yield and variability challenges from device to device.

3 ANALOG SPIKING NEURON NETWORK

3.1 Spiking Neuron

Unlike Boolean logic with digital representations and clocked operations, brain-inspired systems exhibit more robustness and reliability based on distributed, event-driven, collective, and massively parallel mechanisms. Such systems make extensive use of adaptation, self-organization, and learning [27]. Efforts to bridge the gap between the scale and performance of mammalian neural networks have turned to emulating certain aspects of the form of biological nervous systems as well as their abstract functionality, with the development of dense arrays of neurons wherein certain portions of the circuit act as axons, and synapses: Following the naming of the biological components, an artificial neuron is an Integrate-and-Fire (IAF) unit, receiving external excitations from the axons of the preceding neurons through the synapses, as shown in Fig. 3a. Despite decades of research on the implementation of silicon neurons, the current artificial neurons are still much larger in physical size and power than a general human neuron. Considering the large number of neurons in a biological-scale network, this imposes both performance and power-efficiency constraints. Consequently, reducing the power and chip area of artificial neurons is of significance in implementing larger systems for higher level tasks.

Given the two resistance states of the RSD, a spiking neuron is constructed by pairing the RSD with a transistor as a configurable impedance. Fig. 3b shows the structure of a spiking neuron cell with the synapse receiving the input spike. The resistance of a RSD $R_M$ switches between insulative ($R_I$) and conductive ($R_C$) states. To simplify the analysis, the equivalent impedance of the transistor is represented as $R_L$. $R_M$ and $R_L$ are connected in series as a voltage divider, hence the voltage of the connection node $V_O$ has two stable levels

$$V_O = \frac{V_{DD} \times R_L}{R_L + R_M}$$

$$= \begin{cases} V_{DD} \times \frac{R_L}{R_L + R_M} = V_I & \text{for insulative state} \\ V_{DD} \times \frac{R_L + R_C}{R_L + R_C} = V_C & \text{for conductive state} \end{cases}$$

(1)

As shown in Fig. 3c, the neuron behavior with a pre-excited spike contains three stages:

1. the RSD charges $C_L$ to $V_C$,
2. the RSD transitions to insulator state,
3. $R_L$ discharges $C_L$ to $V_I$, and it stops here.

The synapses, receiving input spikes and reduce the total equivalent impedance from $V_O$ to the ground, will lower the voltage level and trigger the output spiking. As shown in Fig. 3d, whenever the neuron receives sufficient input spike(s) from the synapses, $V_O$ reaches the triggering voltage and goes to the fourth step:

4. the RSD transitions to conductor state, with which the neuron spikes again.

The basic function of a neuron cell is generating a spike when receiving excitements over a certain threshold, which in our case is determined by $V_{BIAS}$. Figs. 4a and 4c show the case the input spike doesn’t trigger a output spike. Increasing $V_{BIAS}$ lowers the stable voltage $V_I$, and vice versa. When $V_I$ is closer (farther) to the IMT condition of the RSD, the neuron need fewer (more) input spikes to trigger the output spike. Therefore, a higher $V_{BIAS}$ means the neuron is spiking based on a lower threshold of number of input spikes. Figs. 4b and 4d show the case that the neuron with a higher $V_{BIAS}$ spikes for the same amount of input spikes.

3.2 Neuron Network

A single neuron is a device of extremely limited computational capability; Neural network models to solve complex problems demand large numbers of neurons deployed in an interconnected network. Thanks to its compact structure, the crossbar architecture is widely adopted for connections in silicon-neuron networks [28]. Fig. 5a shows the crossbar
structure, in which the vertical lines are the axons (spike source) $A_N$ and the horizontal lines are the outputs of neurons $N_M$, connected by the synapses $S_{MN}$ with respective weights $W_{MN}$. In the following simulations of neuron network behaviors, we use the input pattern shown in Fig. 6a. Fig. 6b shows the neuron spiking behavior with different thresholds. As mentioned in Section 3.1, the higher $V_{BIAS}$ induces a lower insulative state voltage $V_I$, and thus requires fewer input spikes to excite an output spike.

The weights of the synapses can be zero (no connection) or positive/negative value (positive/negative correlations). A positive (negative) correlation means the neuron is more (less) likely to spike if the source axon spikes. Fig. 5b shows the basic synapse, in which $V_{ON}$ controls the connectivity and $V_{P/N}$ switches the synapse between pulling high and pulling low. The synapse has the following operations:

1. No correlation [0]: $V_{ON} = low, V_{P/N} = X(don't care)$.
2. Positive correlation [+] : $V_{ON} = high, V_{P/N} = high$.
3. Negative correlation [-] : $V_{ON} = high, V_{P/N} = low$.

Fig. 6c shows the output spikes can have both positively (+) and negatively (−) correlation to the input spikes with the basic synapses.

In a rate coding mode the spikes can represent different value, while in the basic mode the spike contains only 1/0 (TRUE/FALSE) informations. Fig. 5c shows the advanced hybrid synapse that can switch between short-term (basic) and long-term (rate coding) modes. The diode connected transistor has single-direction propagations which charge the memory capacitor $C_M$ when the input spikes occur, and the other transistor in parallel works as a switch between long- and short-term modes. In the long-term mode $V_{L/S}$ is low, so the $C_M$ is not discharged during the falling edge of the input spike. Therefore, the hybrid synapse has two addition operations:

4. Long-term positive correlation {++}: $V_{L/S} = low, V_{ON} = high, V_{P/N} = high$.
5. Long-term negative correlation {−−}: $V_{L/S} = low, V_{ON} = high, V_{P/N} = low$.

Fig. 6d shows the rate coding mode neuron network behaviors. For each additional long-term positively correlative (++) input spike, the output spike increases the frequency, and vice versa.

The spiking neural network can be used in simple applications like pattern matching or event detection [29], or be constructed as the large-scale systems like convolution neural networks (CNN) to support more complex applications like written digit recognition [30] and face detection [31].

4 Oscillators and Coupling

Oscillators that weakly couple, as through a common substrate for mechanical oscillators, or via capacitive coupling among outputs in electrical oscillators, have collective synchronization properties that can be used to perform computation. To date, many of the systems designed to
perform computation via oscillator coupling have primarily been intended to perform tasks in the fields of image processing, pattern analysis and computer vision. By utilizing the locking behavior of coupled oscillators as a computational primitive analogous to a distance metric, the systems are capable of performing associative matching functions. The recent development of nano-oscillator based associative memories has further enhanced the potential for oscillator based systems for intelligent information processing (details in Section 8). In most of these works, however, each oscillator network (i.e., the specific topology of and weighting of oscillator coupling) has been constructed in a homogeneous fashion that focuses on solving a specific problem with a given network. In this paper, we examine the ways in which the computational paradigm can be extended, and the networks configured, to support a broader family of functions for a specific domain on a single computing fabric where each tile in the fabric contains a tunable oscillator network.

In the rest of this section, we build upon the introduction to hyper-FET oscillators in Section 2 and discuss how the timing and degree of synchronization among \( N \) weakly-coupled oscillators corresponds to certain many-body computations. We then introduce the particular coupling topology of capacitive coupling with a common output node that we will focus on when considering coupled oscillators.

### 4.1 HyperFET Oscillator

Similar to the spiking neurons, Hyper-FETs can also be used to construct nano-oscillators. Fig. 7a shows the structure of a relaxation oscillator with the RSD, in which the resistance \( R_M \) switches between insulative \((R_I)\) and conductive \((R_C)\) states. To simplify the oscillator model, the parasitic capacitance \( C_P \) in Fig. 7a is lumped to \( C_L \). As shown in Fig. 7b, the oscillation cycle contains four stages:

1. the RSD charges \( C_L \) to \( V_C \),
2. the RSD transitions to insulator state,
3. \( R_L \) discharges \( C_L \) to \( V_I \), and then
4. the RSD transitions back to conductor state.

Without the phase transition, the system would tend to stay in one of the two stable voltage levels, \( V_I \) and \( V_C \). The respective stable current amounts through the RSD \( (I_{SM}) \) are...
related to the stable voltages, 
by increasing the input voltage (input, producing a voltage-controlled relaxation oscillator 
Replacing the resistor with a transistor introduces a control 
with the current amount in the insulative state ($I_I$) is lower 
For the system to oscillate, the stable region, determined 
by the discharging current 
$\text{frequency} \propto \frac{R_M + R_L}{R_M \times R_L \times C_L}$.

Fig. 8. (a) The frequency of VCRO under various $V_{IN}$ conditions, and the waveforms of $V_O$ around (b) the upper and (c) the lower oscillating boundary of $V_{IN}$. (Dashed: oscillating, solid: stable.)

The current flowing from one oscillator, through the coupling capacitors, delays the other oscillators’ next rising. For instance, the rising of $V_{O1}$ induces current to the $C_{prop}$s, and causes a delay on the discharging of $V_{O2}$ and vice versa. The charge induced by the rising of $V_{O1}$ on $V_{O2}$ is discharged by the discharging current $I_{D2}$, which decreases when $V_{O2}$ approaches the stable voltage $V_I$. Consequently, the delay is longer when $V_{O2}$ is closer to the next rising. That means when $T(P1,V2) < T(P2,V1)$, the delay makes the discharging of $V_{O2}$ longer than that of $V_{O1}$, and therefore reduces the difference between $T(P1,V2)$ and $T(P2,V1)$. As a result, the system stabilizes at the state of $T(P1, V2) = T(P2, V1)$ when $V_{IN1} = V_{IN2}$.

In the case of $V_{IN1} \neq V_{IN2}$, the VCROs have different oscillating frequencies. However, the oscillations will still couple together if the input voltage difference $\Delta V$ is in a certain range. Fig. 9d shows the case of $V_{IN1} < V_{IN2}$. With the higher input voltage, the VCRO2 oscillates faster, and therefore $V2$ gets closer and closer to $P1$. However, the rising $V_{OUT1}$ induces a delay on the discharging of VCRO2, as previously mentioned, which extends the oscillation period of VCRO2. Because VCRO2 tends to oscillate faster, and the delay prevents it from speeding up, the oscillations are locked in an unbalanced phase difference but still at the
same frequency. In an unstable or pre-stable oscillation, there is an ongoing phase change. If the oscillations are going to stabilize, the amount of phase change decreases in each cycle. As shown in Fig. 9d, the phase change is about $1 \text{m}\mu\text{s}$ at the beginning and $0:1 \text{m}\mu\text{s}$ after five cycles. We note that phase change is less than $1e-4 \text{m}\mu\text{s}$ after 10 cycles, and the system stays locked in phase as simulated for over 1,000 cycles. We declare the system has stabilized within 10 cycles because $1e-4 \text{m}\mu\text{s}$ is close to the time granularity of simulation. If we further increase the difference of $V_{IN1}$ and $V_{IN2}$, the oscillators would have different frequencies while one oscillator has the shorter cycle (individual oscillation period plus the coupling delay) than the others. In that case, the out-of-phase oscillations can be observed in a few cycles.

### 4.3 N Coupled Oscillators

A key appeal of computing using coupled oscillators is that, by increasing the number of oscillators that are coupled together, the degree-of-difference of vectors can be simultaneously computed. The coherent oscillations are synchronized in the same frequency and stabilized at a constant phase difference to each other, and the coherence of the oscillators is correlated to the similarity of the input voltages. In the experiments, three-coupled oscillators have been measured as shown in Fig. 10a. Figs. 10b and 10c show the case of three-coupled oscillators in simulation. Similar to two-coupled oscillators, the three-coupled oscillators synchronize at the same frequency and have equivalent phase difference, $2\pi/3$ when they have same input voltages (Fig. 10b). A VCRO with $V_{IN}$ higher than average tends to oscillate faster than the others and vice versa. Fig. 10c shows the case of three oscillators with unequal inputs ($V_{IN1} > V_{IN2} > V_{IN3}$). As discussed in Section 4.2, the currents passing through the coupling capacitors are the key for the oscillator synchronization, as the voltage rising of one oscillator induces a current that delays the discharging of the others. However, this effect

---

**Fig. 9.** The topology and schematic of (a) the traditional coupled relaxation oscillator pair and (b) proposed configurable synchronized oscillator network. The output waveforms of coupled oscillators for (c) $V_{IN1} = V_{IN2}$, (d) $V_{IN1} < V_{IN2}$. **Fig. 10.** (a) The measured frequency spectrum and time domain waveforms of the coupled oscillators ($n = 3$). The output waveform ($n = 3$) for [$V_{IN1}, V_{IN2}, V_{IN3}$] = (b) [450 mV, 450 mV, 450 mV] (coherent inputs) and (c) [550 mV, 500 mV, 450 mV] (incoherent inputs). (e) The output waveform ($n = 9$) for all $V_{IN} = 450 \text{ mV}$. 
larger coupling capacitors strengths among nine oscillators equivalent to that for 3, shared by eight instead of two. Thus, to achieve coupling rent induced by the rising of one of the oscillators is now increases. As we go from \( n = 3 \) to \( n = 9 \), the coupling cur-
chronization time increases as the number of oscillators 

4.4 Phase Information

The swing of \( V_{\text{OUT}} \) reflects the interaction of \( V_{\text{O1}} \) and \( V_{\text{O2}} \); \( V_{\text{OUT}} \) rises when either \( V_{\text{O1}} \) or \( V_{\text{O2}} \) rises, and falls when both of them fall. In one cycle of the oscillation, there are two pulses of \( V_{\text{OUT}} \), one for the charging of VCRO1 and the other for that of VCRO2. Therefore, the pulses would be in the same amplitude if \( V_{\text{IN1}} = V_{\text{IN2}} \). The amplitude of the \( V_{\text{OUT}} \) waveform increases if one falling time is shorter than the other, which happens when \( V_{\text{IN1}} \neq V_{\text{IN2}} \) within the bounded range of allowable \( \Delta V \) where the circuit is stable.

The output of synchronized oscillations has three proper-
ties. First, the synchronized oscillators generate a stable amplitude corresponding to the degree of match for the inputs with low deviation. Second, inputs with high deviation break synchronization, and the amplitude of \( V_{\text{OUT}} \) becomes non-uniform. Third, an oscillator within a coupled oscillator network can be intentionally shut down by providing a large or small \( V_{\text{IN}} \) out of the oscillation boundaries, as illustrated in Fig. 8.

For inputs corresponding to the first property, the output behavior is close to the mathematical formulation of deviations (e.g., standard/absolute deviation) in the region of synchronization. Fig. 12 shows the simulation results of the oscillator-based deviation comparing to the corresponding mathematical model (standard deviation). The key difference between the two deviation approaches is that the out-
put of the simulation results is less sensitive to a higher input, which is due to the non-linearity of \( g_{\text{m}} \) of the transistor. To deal with non-synchronizing inputs, we employ thresholding to detect peak amplitudes beyond the acceptable range. In addition, forcing the shutdown of certain oscillators allows an N-oscillator array to emulate K-input functions for \( K < N \).
4.5 Amplitude Readout Circuit

Fig. 13a shows the read-out circuitry. The input to the read-out circuit is $V_{OUT}$ from the synchronized oscillators, which is DC-biased by the biasing network $R_{BIAS}$. The source follower operating with $I_{BIAS}$ works as a buffer for $V_{OUT}$. Finally, the diode-connected transistor rectifies and follows $V_{SF}$ to generate the readout voltage $V_{ANALOG}$ at the load capacitance $C_L$. Fig. 13b shows the conversion from $V_{OUT}$ amplitude to the analog voltage output $V_{ANALOG}$ that the readout circuit performs.

5 Configurable Oscillator Computations

As presented in the previous section, HyperFET oscillators provide a powerful, but limited-flexibility, primitive computational operation. In this section, we extend the computational capabilities of each oscillator by adding additional control inputs to configure its behavior. With these additional inputs, we can now efficiently realize a family of related primitives with a given oscillator network, rather than a single functionality.

The coherence of the synchronized oscillators is defined as the similarity of their oscillation frequencies. As mentioned previously in Section 4.1, the oscillation frequency of a VCRO is determined by $V_{IN}$, which is linearly correlated to the discharging current $I_D$,

$$I_D = I_R + I_T = I_R + (g_m \times V_{IN} + I'_T). \quad (6)$$

Generally speaking, the relation between $I_D$ and $V_{IN}$ can be configured by changing the resistance and the transistor size. Motivated by the fact that the similar $I_D$s induce the similar oscillation frequencies, we explore the configurable mapping between $V_{IN}$ and $I_D$ in this section.

5.1 Configurability

5.1.1 Base Case

Starting from $n = 2$, a coupled-oscillator structure is shown in Fig. 14a. The oscillation strength of a VCRO, which is defined as the individual oscillation frequency when it is not coupled, is positively correlative to the amount of the discharging current $I_D$. Two transistors sized $W_1$ and $W_2$ are acting as voltage controlled current sources, and two resistors ($R_{L1}$ and $R_{L2}$) provide biasing current to the system. In the base case, the system is biased in balance, and the transistors have equal size. As shown in the right half of Fig. 14b, the lowest $V_{OUT}$ amplitude lays on the diagonal, which corresponds to $V_{IN1} = V_{IN2}$. The left half of Fig. 14b shows the cutting plane of $V_{IN1} = 450$ mV. For the unbalanced inputs ($V_{IN1} - V_{IN2} = \Delta V$), the amplitude increases with $\Delta V$. The peak amplitude of $V_{OUT}$ stops.
increasing when the VCROs are no longer coupled, and the maximum peak amplitude is around the amplitude of $V_{OUT1}$ or $V_{OUT2}$.

In this work, a “similarity line” is defined as the collection of conditions ($V_{IN1}, V_{IN2}$) for which the lowest $V_{OUT}$ amplitude occurs. Essentially, the similarity line represents the condition that the oscillations are exactly coherent. The similarity line for the base case can be described as $L_S : \{V_{IN1} = V_{IN2}\}$. Accordingly, the circuit in Fig. 14a outputs a signal $V_{OUT}$, of which the amplitude relates to the distance between the point ($V_{IN1}, V_{IN2}$) and the corresponding similarity line.

5.1.2 Shifting Case

The biasing resistors $R_{L1}$ and $R_{L2}$ can be designed to be unequal to each other, and the unbalanced biasing results in a shift on $L_S$. Fig. 14c shows the case for an increased $R_{L2}$. When the biasing current decreases with the enlarged resistance, the total discharging current on VCRO2 drops under the condition $V_{IN1} = V_{IN2}$. Therefore, VCRO2 oscillates slower, and the amplitude of $V_{OUT}$ increases for the unbalanced oscillation. To cover the biasing current reduction, VCRO2 needs a higher input voltage; i.e., a voltage shift on $V_{IN2}$ results in the same driving strength compared to VCRO1. Effectively, $V_{IN2}$ is subtracted by a value $V_S$. As a result, the similarity line is shifted as $L_S : \{V_{IN1} = V_{IN2} - V_S\}$, where $V_S$ is the voltage shift.

5.1.3 Narrowing Case

The transistor size of the VCRO corresponds to the ratio of current change to the input difference, and therefore affects the sensitivity of the oscillation frequency. Fig. 14d shows the case for an increased $W/2$. The increase and decrease amounts of current are proportional to the transistor size according to Eq. (6). Therefore, a given voltage variation induces more deviation to the output with larger transistor size, and narrows the width of valley. As a result, the different sensitivity factors ($a_1, a_2$) can be assigned to each of the VCROs by giving different transistor sizes.

In summary, given the configurable behavior of the coupled VCROs, the similarity line could be virtually a linear combination of $V_{IN1}$ and $V_{IN2}$, i.e.,

$$L_S : \{a_1 \times (V_{IN1} - V_{S1}) = a_2 \times (V_{IN2} - V_{S2})\}. \quad (7)$$

Generally speaking, the minimum amplitude at the node $V_{OUT}$ occurs when the controlling currents $I_D$ of the two oscillators are equal, giving equal, or near equal frequencies, and enabling the oscillators to couple to a common frequency with equally distributed phases.

5.2 Configurable VCRO Module

Based on the features described in Section 5.1, we can build configurable VCRO modules in systems of n-oscillators. Fig. 15a shows the structure of the proposed configurable oscillator. The discharging current $I_D$ is provided by three components: two transistors sized $W$ and one resistor ($R_L$),

$$I_D = I_R + I_T = I_R + g_m \times (V_x + V_y) + I_T. \quad (8)$$

The transistor sizes of each of the synchronized oscillators could be different to give various $g_m$ ratios to the system.

Essentially, those two transistors are replacing the transistor in the non-configurable VCRO (Fig. 14a), splitting the input $V_{IN}$ to $V_x$ and $V_y$. Fig. 15b shows the simulation results of that one configurable VCRO with sweeping input conditions, $V_x$ and $V_y$ coupled to another VCRO module with fixed input ($V'_x = V'_y = 450$ mV). Observed from the simulation, the minimum $V_{OUT}$ amplitude occurs when $(V_x + V_y)/2 = 450$ mV, which means $V_x$ should be negatively correlated to $V_y$ for the same $I_D$. Therefore, $V_y$ can be a configurable parameter that changes the correspondence between $V_x$ and $V_{OUT}$ amplitude.

As shown in Fig. 15c, the transistor size can be flexible by splitting the transistor into multiple switch-controlled transistors. Similarly, the resistor can be replaced by another transistor to become reconfigurable.

5.3 Mathematical Expression

There are several useful mathematical vector operations that can be mapped to the behavior of synchronized VCROs. For a configurable VCRO, the voltages $V_x$ and $V_y$ are in a negative correlation for the same oscillation strength in terms of $I_D$. To achieve the comparing function between the input $x$ and target $y$, the numerical $y$ is inversely mapped to voltage $V_y$, as shown in Fig. 15d.

To visualize the relation between the configuration change and function change, we define the behavior of the synchronized oscillators as the analog domain and the parameters of configurations as the numerical domain. Specifically, we are trying to map the numerical range of the inputs, e.g., 1 to 256, to the active region of the oscillators in terms of voltage range, e.g., 0.35 V to 0.55 V. In Fig. 16, the analog domain and the numerical domain are marked as
the blue and red blocks, respectively. In the analog domain, there is an active region with the diagonal similarity line. For the base case, the numerical domain maps perfectly onto the numerical domain. In the shifting case, the input number is multiplied by a factor $a_2$. As a result, the extended space in numerical domain covers larger than the active region of analog domain. Thus, the similarity region is equivalently narrowed by $a_2$ seen from the numerical domain. The mapping relations above demonstrate the reason for the shifting and narrowing in Fig. 14.

For the $n$-synchronized oscillators, there is a similarity line in $n$-Dimensional space, indicating the condition of exact coherence for the $n$-oscillator system. After the mapping, the similarity line becomes $L_S : \{a_1(x_1 - y_1) = a_2(x_2 - y_2) = \cdots = a_n(x_n - y_n)\}$ in the $n$-Dimensional space, and $a_i$ is proportional to the transistor width $W_i$.

The functionality of the amplitude of $V_{OUT}$ is the deviation of the oscillation frequency, which is positively correlated to $I_{PbS}$. If the vector inputs $X$ and $Y$ are on the similarity line, the amplitude of $V_{OUT}$ is the minimum. Otherwise, $V_{OUT}$ returns the deviation $D(a(X - Y))$ (within the similarity region).

Based on the properties of the proposed configurable VCROs, the functions of an $n$-oscillator system include, but are not limited to:

(I) \[ D(X) = \left( \frac{1}{k} \sum_{i=1}^{n} (x_i - y_i)^k \right)^{\frac{1}{k}} \] Measuring the deviation of factor $k$ of a set of input vector $X$, or checking if the deviation is above some threshold. The coefficient $k$ is determined by the coupling and load capacitance, and $k$ for mathematical standard and absolute deviation are 1 and 2, respectively.

(II) \[ D(aX) \] Finding the matching degree for the $n$ elements of $X$ to a given ratio $\left(\frac{1}{w_1}, \frac{1}{w_2}, \ldots, \frac{1}{w_n}\right)$. It can also be used as the distance of the input $X$ to the given line when the function is extended to $D(a(X - Y))$. Given a threshold, the oscillators would detect the points in the cylinder region around the line.

(III) $D((X - Y) \cup \{0\})$. Finding the similarity of two points $X$ and $Y$. Given a threshold, the oscillators would detect the points in the ball region around $Y$. The shape of the ball region can be changed when the function is extended to $D(a(X - Y)) \cup \{0\}$.

(IV) $D((X - Y)')$. $(X - Y)' = \{x_i - y_i\} | (x_i - y_i) \in \text{oscillation range}$. Measuring the deviation of only the input elements that are in the oscillation range.

where input vector $X = (x_1, x_2, \ldots, x_n)$, and configurable parameters are:

\[
\begin{align*}
\text{narrowing} & : \alpha = (a_1, a_2, \ldots, a_n), \\
\text{shifting} & : Y = (y_1, y_2, \ldots, y_n)
\end{align*}
\]

Fig. 17 shows the diagrams of the functions above.

6 CONFIGURABLE OSCILLATOR APPLICATIONS

To parallelize the repetitive computation of, for example, image filtering, a system can be built with an array of processing units. Fig. 18a shows the system diagram of parallel computation with 10-by-10 array of oscillator-based processing modular units, each with nine coupled oscillators. The image with a large number of analog pixels is captured by the camera sensor, segmented into windows which are sized the same as the number of processing units (10-by-10), and processed by the units in parallel. The control signals, $V_a$ and $Y$, can be either identical or different to each processing unit, depending on the application. In most of the cases, the control signals come from the higher level architecture, and are fixed for the repetitive computations.

Each of the processing units is an independent 9-oscillator module that can perform the processing in parallel with other modules. As shown in Fig. 18b, the 9-oscillator module is composed of the 9-synchronized oscillators, the readout, and the thresholding circuits (voltage comparator).
For the image processing applications as an example, nine input voltages ($V_{x1}, V_{x2}, \ldots, V_{xn}$), corresponding to one center pixel and the eight neighboring pixels, are connected to the oscillators. The control signals, in terms of $V_y$ and $V_x$, configure the oscillator for the particular functions.

### 6.1 Distance and Deviation

The 9-synchronized oscillators naturally perform a deviation measurement by reflecting the distance between the input point $X$ and the similarity line in n-Dimensional space. By fixing one of the inputs to, for example, zero, the similarity line becomes a point in the (n-1)-Dimension, as illustrated in Fig. 17-III.

### 6.2 Image Filtering

To process the pixel values of an image, as the example shows in Fig. 18a, the oscillators are controlled by different configuration control values to perform various functions. Table 1 shows the configures of the oscillators for various functions, and Fig. 19 shows the original and processed images.

#### 6.2.1 Salience and Edge Detection [35]

The salient point is more likely to be an edge if it locates in a region with more deviation. Obtained by measuring the deviation of nine neighboring pixels, a pixelwise salience map can be used as edge information.

#### 6.2.2 Directional Edge Detection [36]

Another edge detection approach is to detect edges in certain directions. To detect a line in, for example, the vertical (column) direction, we can compute the deviation of three pixels in a row direction. The higher deviation in the horizontal direction indicates that the more likely it is an edge in vertical direction.

#### 6.2.3 Dilation and Erosion [37]

By intentionally shifting the input to the oscillation boundary of the VCRO, the oscillation occurs only when the input

---

**Table 1**

<table>
<thead>
<tr>
<th>Function</th>
<th>X</th>
<th>Y</th>
<th>$\alpha$</th>
<th>Output</th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td>Salience</td>
<td>Pixel values</td>
<td>No shift</td>
<td>Equal weight</td>
<td>$D(X)$</td>
<td>Pixelwise salience for detection edge</td>
</tr>
<tr>
<td>Detection</td>
<td>$x_1 \to x_9$</td>
<td>0's</td>
<td>1's</td>
<td>Oscillate?1:0</td>
<td>$V_{TH}$: Threshold, $V_{LB}$: Lower boundary, $V_{UB}$: Upper boundary</td>
</tr>
<tr>
<td>Dilation / Erosion</td>
<td>Pixel values</td>
<td>$V_{TH} + V_{LB}$ / $V_{TH} - V_{LB}$</td>
<td>Equal weight</td>
<td>1's</td>
<td>Detect color in given RGB ratio</td>
</tr>
<tr>
<td>Color Detection</td>
<td>RGB values</td>
<td>No shift</td>
<td>$1/(\text{Target RGB})$</td>
<td>$D(\alpha X) = \alpha_1 \to \alpha_3$</td>
<td>Detect color in given RGB value</td>
</tr>
<tr>
<td>RGB $\cup {0}$</td>
<td>$y_1 \to y_4$</td>
<td>Target RGB $\cup {0}$</td>
<td>Equal weight</td>
<td>$D(X-Y) = \alpha_1 \to \alpha_9$</td>
<td>Degree of mismatch for 8-bits pattern</td>
</tr>
<tr>
<td>Pattern matching</td>
<td>Input sequence $\cup {0}$</td>
<td>Target pattern $\cup {0}$</td>
<td>Importance</td>
<td>$D(\alpha (X-Y)) = \alpha_1 \to \alpha_9$</td>
<td>Degree of mismatch for 8-bits pattern</td>
</tr>
</tbody>
</table>

---

Fig. 18. (a) Image preprocessing with paralleled oscillator modules. (b) The 9-oscillator module composed by the 9-synchronized configurable VCRO and peripheral output circuits.

Fig. 19. The input and output images of oscillator based-processing functions.
voltage is higher (or lower) than the threshold. The synchronization doesn’t really matter; detecting any high bit (or low bit) in a set of five pixels (central and four neighbors) returns the dilation (and erosion) filtering.

### 6.2.4 Color Detection [38]

The color information of a pixel can be described in three values in the R-G-B domain. Colors with the same R-G-B ratio would appear as the same color of different brightness. Therefore, the detection for a certain color can be done by 3-synchronized oscillators. With 4-synchronized oscillators (R-G-B and zero), the color range can be fixed around a specific pixel value, which can be another different application.

#### 6.3 Weighted Pattern Matching

The $X/Y$ function of the configurable oscillators essentially performs the matching between patterns. Independent weights can be achieved by assigning different $\alpha$’s to each oscillators, as shown in Fig. 20. The target pattern Y is configured before sliding the window. Whenever the oscillator-based deviation module finds a matching sequence, the thresholding circuit would output a bit-0, indicating a low difference between X and Y occurs.

### 7 PERFORMANCE EVALUATION

In this section, the performance evaluation is based on the properties and experimental results with $V_{O2}$ as the RSD, scaled to a reasonable comparative size, and projected to the feasible operation frequency. First of all, we compare the performance of a single 9-oscillator module with a customized CMOS application-specific integrated circuit (ASIC) pipelined accelerator designed to perform the same function. Then we compare the array of 100 parallel 9-oscillator modules with the CMOS-based data path from a system prospective.

#### 7.1 Modularized Deviation and Scalability

Fig. 21 shows the oscillator-based deviation circuit module and the corresponding CMOS ASIC in 32nm technology. The proposed oscillator-based module performs the deviation calculation with configurable parameters, e.g., $D(\alpha(X-Y))$. To perform the comparable function, the CMOS ASIC is designed to calculate $D(Z) = \sqrt{\frac{1}{9} \sum_{j=1}^{9} (z_j - \frac{1}{9} \sum_{i=1}^{9} z_i)^2}$, where $Z = \alpha(X-Y)$. The inputs X and Y for CMOS ASIC are both 9-element arrays with 8-bit elements, and the input $\alpha$ is a 9-element array with 2-bit elements. The CMOS ASIC baseline deviation module is a 42-stage pipelined accelerator, synthesized using Synopsys Design Compiler [39]. Based on the synthesis, the CMOS ASIC module can operate up to 500 MHz, and consumes $1,100 \mu W$.

The channel size of $V_{O2}$ determines the voltage and resistance, and thus changes the power consumption of the oscillator-based module. Fig. 22a shows the power estimation with the scaling of $V_{O2}$ channel dimensions. In order to compare the oscillator performance with 32nm CMOS technology, we project the power consumption for scaled oscillators with dimensions ([W, L] = [60 nm, 36 nm]). We assume that the critical stimulus (electric field here) required for triggering the IMT in $V_{O2}$ remains constant [23]. We also assume that the resistance would remains constant with the same aspect ratio. Accordingly, the power consumption is $4.84 \mu W$ per module after scaling (synchronized oscillators: $4.59 \mu W$, read-out: $0.19 \mu W$, thresholding: $0.06 \mu W$).

Although the oscillation speed is inversely proportional to the load capacitance, the phase transition time limits the increase of the operating frequency of the 9-oscillator modules. We found in the simulation that if the oscillation period is too short, the phase transitions of the oscillators would possibly overlap and make the synchronization less predictable. Consequently, we need a shorter phase transition time to make the module operates faster. Fig. 22b shows...
the projection of operation speed. We assume that the intrinsic phase transition time \( T_0 \) is small compared to the RC charging time \( T \) determined by the load capacitance. The optically induced phase transitions for \( V_{O2} \) has been reported to occur in as little as \( \sim 75 \) fs [40], and the electrically induced transition has been reported as \( \sim 200 \) ps [41] (experimentally measured) in a similar MIT material \( V_{O3} \). Therefore, the proposed oscillator-based module can operate up to \( 9.28M \) Op/s.

In summary, a single oscillator-based deviation module operates at \( 1/54 \) speed \( (\frac{2.28 \text{ MHz}}{228 \text{ MHz}}) \), but consumes \( 1/227 \) power \( (\frac{6.84 \mu \text{W}}{1.06 \mu \text{W}}) \) of the CMOS-based module.

### 7.2 System-Level Comparison

From the system prospective, the \( 10 \times 10 \) parallel array of 9-synchronized oscillator modules (Fig. 18a) is used to achieve a higher processing throughput, saving the overhead of data conversion and transmission. Fig. 23 shows the sensor chip data paths, with the array of 100 proposed oscillator-based modules and with the conventional CMOS ASIC accelerator. Thanks to the low gate-count of the proposed 9-oscillator module (about 80 transistor counts, approximated to 20 gates per module), it’s more likely be embedded in the same chip with the sensor units. In the conventional data path constructed by the ADC and off-chip ASIC, the pixel values are converted and transmitted before being processed.

Using multiple 9-oscillator modules in parallel, the proposed oscillator-based accelerator can have a higher throughput, and consume less power than the CMOS-based ASIC accelerator. Table 2 shows the specification comparison between the proposed oscillator-based and the conventional CMOS accelerators. Operating at 500 Mpixel/s, the power for the state-of-the-art sensor, based on [42] and [43], are 1.6 mW for the sensor and 6 mW for the ADC. For most general-purpose image sensors applications, the ADC resolution is equal or above 8-bit, so we use 8-bit for the applications in this work. In Fig. 23a, the proposed oscillator-based accelerator processes the pixel values without conversion, consuming 484 \( \mu \text{W} \). In Fig. 23b, the CMOS-Based data path with ADC and ASIC consume 7.1 mW (\( 6 + 1.1 \) mW), which is over 14X the power of the former. Meanwhile, the processed results, instead of the pixel values, are transmitted off the sensor chip. Compared to the CMOS data path using 8-bit ADC, 1/8 (\( 1 \) bit/pixel) data transmission bandwidth are used in the oscillator-based processing.

### 8 RELATED WORK

Recently, the concept of using coupled oscillator systems to perform non-standard computation has gained significant attention. Several previous works have demonstrated coupled oscillator systems geared towards a variety of computing tasks.

In [44], [45] neural oscillators systems are shown to perform image segmentation based on degree of correlation. Those works focus primarily on the dynamics of neural oscillator networks and how they can be applied using a software model to accomplish the task of image segmentation. Further work on image segmentation using coupled oscillators appears in [46]. They explored additional coupling models as well as additional modes of operation (frequency vs phase locking) and coupling (fixed nearest-neighbor versus all-to-all). Work in [47] has shown a system which utilizes both coupled oscillators as well as cellular neural networks (CNNs) to demonstrate contrast enhancement of images. In that work the CNNs and oscillator networks work together to perform the task of contrast enhancement in support of high quality image segmentation. In addition to segmentation, other forms of image processing have been researched using coupled oscillators. [33] further demonstrates the use of coupled oscillators to perform additional image processing tasks, edge detection and visual saliency. Using coupled Kuramoto oscillators [48], [49], [50], that work shows how the locking and synchronization time of a system of oscillators can be used to strongly identify edge pixels in an image and fuzzy regions of stark contrast within an image, which can correspond to visually salient regions.

All of the aforementioned approaches presume an array of many oscillators connected in a fully-connected 2-D network, and the stabilization is slowed by the higher order of

<table>
<thead>
<tr>
<th>Oscillator-based</th>
<th>CMOS-based</th>
</tr>
</thead>
<tbody>
<tr>
<td>Channel Length</td>
<td>40 nm</td>
</tr>
<tr>
<td>Power</td>
<td>484 ( \mu \text{W} )</td>
</tr>
<tr>
<td>Max Throughput</td>
<td>928 M Op/s</td>
</tr>
<tr>
<td>Max Frame Rate &amp;</td>
<td>3.5K Frame/s</td>
</tr>
<tr>
<td>Transmission Per Frame for 512 ( \times ) 512 pixels</td>
<td>262K bits/Frame</td>
</tr>
<tr>
<td>Max Frame Rate &amp;</td>
<td>110 Frame/s</td>
</tr>
<tr>
<td>Transmission Per Frame for 4 K 2 K approximate transistor count</td>
<td>8 M bits/Frame</td>
</tr>
</tbody>
</table>

\[ W_1 \approx \frac{2.28}{100} \text{ MHz} \]

\[ W_2 \approx \frac{1}{100} \text{ MHz} \]
dependency among the oscillators. In contrast to the network topology used in previous works, this work focuses on a modular 9-synchronized oscillator unit in which the oscillators are coupled through a common center node and stabilization is faster than in a fully-connected 2-D network.

The VO\textsubscript{2}-based architectures for implementation of the oscillator pairs have been demonstrated with RC time constant model simulations based on real devices fabricated using VO\textsubscript{2} [3]. The work relies on arrays of oscillators coupled in pairs and requires a significant portion of CMOS-based circuitry in order to read out the values from the array. In contrast, this work utilizes a different read-out architecture, based on a star-connected topology in which a single read-out node rather than each oscillator output node should be monitored, thereby eliding the complicated read-out circuitry of previous approaches.

Template- and pattern-matching is demonstrated using coupled VCOs in [51]. The work details how the output of such an oscillator array may be used to determine a degree of match between two patches, or vectors, and therefore can be used to detect the template, from a set of templates, which is most similar to a test image. Extending this idea, [52], [53] shows how such a correlation engine based on coupled oscillators may be used to implement parts of a much more complex image processing algorithm, HMAX, which is used for feature extraction. The most compute-intensive stages of the HMAX algorithm, Gabor convolution and template correlation, are retargeted to oscillator architectures for processing. In [34], [54], arrays of coupled VCO oscillators are demonstrated in an associative memory architecture for image recall and reconstruction. In those works, the addresses as well as the content of the memory are template images, and indexing is done by finding the template which most closely matches the input. Prior work targets the particular compute-intensive portions of an algorithm with dedicated, fixed-function oscillator accelerators. In contrast, this work explores the possibility to offload the repetitive computations to a single, multi-function accelerator that is useful across many algorithms.

Those works demonstrate a variety of computational tasks that have been explored using coupled oscillator arrays. However, each of those architectures is geared specifically towards a given task. These rigid systems require multiple oscillator arrays to handle varying computational types as well as input sizes. This work proposes a dynamic architecture which includes dynamic adjustment and modulation of the input using control transistors. This not only allows online retargeting of the array towards different computing tasks, but also supports computations not previously explored, both within and outside of the image processing domain.

9 Conclusion

In this work, we show how HyperFETs, an emerging device based in IMT materials, align with three current power-reducing trends in emerging devices and architectures, namely steep-slope transistors, neuromorphic architectures, and non-Boolean processing paradigms. We describe the utility of HyperFETs as, or as part of, computational primitives in each of the three paths.

We present a case study in utilizing HyperFET-based nano-oscillators for visual computing, and validate a configurable circuit module of synchronized oscillators with multiple image preprocessing functions in addition to basic deviation measurement. Using different configurations, the response of oscillator discharging current to the input voltage is tunable to achieve a broader set of primitives within the oscillator-based processing module. Scaled to a size comparable to current CMOS technology nodes, the proposed 9-oscillator module operates 54X slower, but consumes 227X less power than a CMOS ASIC. The results also show that the 10 \times 10 array of 9-synchronized oscillator modules are able to provide comparable throughput (928 M Op/s), using only 1/4 power (484 uW) compared to the CMOS-based counterpart.

Acknowledgments

This work was supported in part by the Center for Low Energy Systems Technology (LEAST), sponsored by MARCO and DARPA, by a gift from Intel Corporation, and from the US National Science Foundation under the grants of Expeditions in Computing Award-1317560 and CCF-1317373, and “INSPIRE Track 1: Sensing and Computing with Oscillating Chemical Reactions” DMR-1344178. X. Li and V. Narayanan are the corresponding authors.

References

[1] R. Pandey, H. Madan, H. Liu, V. Chobpatanna, M. Barth, B. Rajamohan, M. Hollander, T. Clark, K. Wang, J.-H. Kim, D. Gundlach, K. Cheung, J. Suehle, R. Engel-Herbert, S. Stemmer, and S. Datta, “Demonstration of p-type in0.7gai0.3as/gaas0.35bs0.65 and n-type gaas0.4sbs0.65/in0.63gai0.35as complimentary hetero-junction vertical tunnel fets for ultra-low power logic,” in Proc. VLSI Technol. Symp., Jun. 2015, pp. T206–T207.


Wei-Yu Tsai received the BS degree in electrical and control engineering from National Chiao Tung University, Taiwan, in 2009 and the MS degree in electrical engineering from National Tsing Hua University, Taiwan, in 2011. He is currently working toward the PhD degree in the Department of Computer Science and Engineering, the Pennsylvania State University. His research interests include neuromorphic computations, oscillator computations, and the other circuit designs in CMOS and emerging devices such as III-V Heterojunction Tunnel FET for high-speed interfaces.

Xueqing Li received the BS and PhD degrees in electronics engineering from Tsinghua University, Beijing, China, in 2007 and 2013, respectively. Since 2013, he has been a postdoctoral researcher in the Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA. His current interests include wideband data converters, RF-powered non-volatile systems, wireless sensor networks, and other circuit designs in CMOS and emerging devices such as III-V Heterojunction Tunnel FET, Negative Capacitance Ferroelectric FET, SymFET, HyperFET, VO2-based oscillators, Single-Electron Transistors (SETs), etc. He is a member of the IEEE.

Matthew Jerry received the BS degree in physics from the University of Delaware in 2013. He is currently working toward the PhD degree in the Department of Electrical Engineering at Pennsylvania State University. His research interests include high-frequency characterization of emerging devices and understanding the scaling and frequency limitations of transition metal oxides.

Baihua Xie received the BS degree in electrical engineering from Nanjing University, China, in 2014. He is a master’s student with the Department of Electrical Engineering at the Pennsylvania State University. His research interest is mainly on hardware implementation of neural networks using emerging devices.

Nikhil Shukla received the BS degree in electronics and telecommunication engineering from the University of Mumbai, Mumbai, India, in 2010. He is currently working toward the PhD degree in electrical engineering in the Department of Electrical Engineering, The Pennsylvania State University, University Park, PA. His research is focused on solid-state devices.

Huichu Liu received the BS degree in microelectronics from Peking University, Beijing, China, in July 2009, and the PhD degree in the Department of Electrical Engineering, Pennsylvania State University, University Park, PA, in May 2015. From 2011 to 2015, she was a research assistant at Pennsylvania State University with the Nano-electronic Devices and Circuits Lab (NDCL) and Microsystem Design Lab (MDL). From May 2011 to August 2011, she was a summer intern at the IBM T. J. Watson Research Center, Yorktown Heights, NY. From June 2014 to August 2014, she was a co-op researcher at Globalfoundries, Santa Clara, CA. She is currently a research scientist at Intel Corporation, Santa Clara, CA. Her research interests include device-circuit interactions using emerging devices for ultra-low-power digital and analog/RF applications. She was one of the winners of the IBM PhD Fellowship Award from 2011 to 2012 Academic Year, and Dr. Nirmal K. Bose Excellence Dissertation Award. She is a member of the IEEE.

Nandhini Chandramoorthy received the BE degree in electronics & communication engineering from Anna University, India. She is currently working toward the PhD degree in the Department of Computer Science and Engineering in The Pennsylvania State University. Her research interests include heterogeneous architecture design for computer vision applications.

Matthew Cotter received the BS and PhD degrees in computer engineering, in December 2008 and May 2015, respectively, both from the Pennsylvania State University. He is currently employed with the Autonomy and Intelligent Sensors Division of the Applied Research Laboratory at the Pennsylvania State University, University Park, PA, 16801. He research interests include image, video, and signal processing designed to target and enable intelligent software and hardware-based systems. He is a member of the IEEE.

Arijit Raychowdhury received the BE degree in electrical and telecommunication engineering from Jadavpur University, India, in 2001 and the PhD degree in electrical and computer engineering from Purdue University in 2007. He is currently an associate professor in the School of Electrical and Computer Engineering at the Georgia Institute of Technology where he joined in January, 2013. His industry experience includes five years as a staff scientist in the Circuits Research Lab, Intel Corporation, and a year as an Analog Circuit Designer with Texas Instruments Inc. His research interests include low-power digital and mixed-signal circuit design, design of power converters and sensors, and exploring interactions of circuits with device technologies. He holds more than 25 US and international patents and has published more than 80 articles in journals and refereed conferences. He serves on the Technical Program Committees of DAC, ICCAD, VLSI Conference, and ISQED and has been a guest associate-editor for The Journal of Emerging Technologies in Computing Systems. He has also taught many short courses and invited tutorials at multiple conferences, workshops, and universities. He received the Intel Labs Technical Contribution Award, 2011; Dimitris N. Chorafas Award for outstanding doctoral research, 2007; the Best Thesis Award, College of Engineering, Purdue University, 2007; Best Paper Awards at the International Symposium on Low-Power Electronic Design (ISLPED) 2012, 2006; IEEE Nanotechnology Conference, 2003; SRC Technical Excellence Award, 2005; Intel Foundation Fellowship, 2006; NASA INAC Fellowship, 2004; M.P. Birla Smarak Kosh (SOUTH POINT) Award for Higher Studies, 2002; and the Meissner Fellowship 2002. He is a senior member of the IEEE.
Donald M. Chiarulli received the MS degree in computer science from the Virginia Polytechnic Institute and the PhD degree also in computer science from Louisiana State University. He currently serves as a professor of electrical and computer engineering at the University of Pittsburgh. He has served on the editorial board of the Journal of Parallel and Distributed Computing and on the National Academy of Sciences Technical Advisory Board for the Army Research Laboratory. He is the author or coauthor of more than 150 technical papers including two that have earned best paper awards at the International Conference on Neural Networks (ICNN) and the Design Automation Conference (DAC). He holds multiple patents for system designs in computer architecture, optoelectronic systems, and signal processing. He is a member of the IEEE.

Steven P. Levitan received the BS degree from Case Western Reserve University in 1972 and the MS and PhD degrees in Electrical Engineering from the University of Massachusetts, Amherst, in 1979 and 1984 respectively. From 1972 to 1977, he worked for Xyologic Systems. He is the John A. Jurekens professor of computer engineering in the Department of Electrical and Computer Engineering where he holds a joint appointment in the Department of Computer Science. He is a past chair of the ACM Special Interest Group on Design Automation (SIGDA). He was a general chair of the 44th ACM/IEEE Design Automation Conference in 2007. He received the Best Paper award at the 1996 IEEE International Conference on Neural Networks and the Best Paper award (Design Methodology) at the 1997 IEEE/ACM Design Automation Conference. His research interests include the design, modeling, simulation, and verification of parallel mixed-signal multilayer systems spanning software, digital and analog electronics, optics, and MEMS. He is a member of ACM and a fellow of the IEEE.

Suman Datta recently joined the University of Notre Dame as the Chang family chair professor of engineering innovation. He was previously a faculty member at Penn State in university science and electrical engineering. He joined Penn State as the inaugural Monkowski associate professor in 2007, and was promoted to a full professor in 2011. Before joining Penn State, from 1999 till 2007, he was in the Advanced Transistor Group at Intel Corporation, where he developed several generations of logic transistor technologies including high-k/metal gate, Tri-gate, and alternate channel CMOS transistor technologies. His research interests are in novel solid-state nanoelectronic materials and devices, understanding of transport mechanisms, and ultralow power circuit applications, with recent emphasis on nonvolatile computing powered by energy harvesters and computing using collective state of coupled systems. He received the Intel Achievement Award (2003), the Intel Logic Technology Quality Award (2002), the Penn State Engineering Alumni Association (PSEAS) Outstanding Research Award (2012), the SEMI Award for North America (2012), IEEE Device Research Conference Best Paper Award (2010, 2011), and the PSEAS Premier Research Award (2015). He is a fellow of the IEEE.

John Sampson received the PhD degree in computer science (computer engineering) from the University of California, San Diego, in 2010. He is an assistant professor in the Department of Computer Science and Engineering at Pennsylvania State University. His research interests include energy-efficient computing, architectural adaptations to exploit emerging technologies, and mitigating the impact of dark silicon. He is a member of the IEEE and ACM.

Nagarajan Ranganathan received the BE (honors) degree in electrical and electronic engineering from Regional Engineering College, Tiruchirapalli, University of Madras, India, in 1983 and the PhD degree in computer science from the University of California, Irvine, in 1988. He is currently a professor in the Department of Computer Science and Engineering at the University of South Florida, Tampa. During 1998 to 1999, he was a professor of electrical and computer engineering at the University of Texas at El Paso. His research interests include VLSI system design, design automation, energy and power optimization, biomedical information processing, crisis management and homeland security applications, computer architecture, and parallel computing. He has developed many special purpose VLSI systems for computer vision, image processing, pattern recognition, data compression, and signal processing applications. He has published more than 200 papers in reputed journals and conferences and is a co-owner of five US patents. He served on the editorial boards for the journals: Pattern Recognition (1993-1997), VLSI Design (1994-present), IEEE Transactions on VLSI Systems (1995-1997), IEEE Transactions on Circuits and Systems (1997-1999), and the IEEE Transactions on Circuits and Systems for Video Technology (1997-2000). He was the chair of the IEEE Computer Society technical Committee on VLSI during 1997-2001. He served as the steering committee chair of the IEEE Transactions on VLSI Systems during 2001 to 2002 and the editor-in-chief form 2003 to 2004. Recently, he received the Distinguished Chair Professor of the University of South Florida, Tampa. He is a member of the IEEE Computer Society, IEEE Circuits and Systems Society, and was elected as a fellow of the IEEE in 2002 for his contributions to algorithms and architectures for VLSI Systems.

Vijaykrishnan Narayanan received the BS degree in computer science and engineering from the University of Madras, India, in 1993 and the PhD degree in computer science and engineering from the University of South Florida, Tampa, in 1998. He is a professor of computer science & engineering and electrical engineering at Pennsylvania State University. His research interests include power-aware and reliable systems, design automation, energy and power optimization, biomedical informatics, computer architecture, and parallel computing. He has developed many special purpose VLSI systems for computer vision, image processing, pattern recognition, data compression, and signal processing applications. He has received several awards including the Penn State Engineering Society Outstanding Research Award in 2006, IEEE CAS VLSI Transactions Best Paper Award in 2002, the Penn State CSE Faculty Teaching Award in 2002, the ACM SIGDA outstanding faculty award in 2000, Upsilon Pi Epsilon award for academic excellence in 1997, the IEEE Computer Society Richard E. Merwin Award in 1996, and the University of Madras first rank in Computer Science and Engineering in 1993. He is currently the editor-in-chief of IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems. He has received several certificates of appreciation for outstanding service from ACM and IEEE Computer Society. He is a fellow of the IEEE.

For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.