Recent Synopsy publication "Why 3DICs?"
Helps understand the <1,000x AI computing achievement of Alibaba.
" Bonding stacked dies into a single package instead of a multiple package on a PCB increases I/O density by 100x.The energy-per-bit transfer can be reduced to 30x with the latest technology."
Leveraging Hybrid Bonding as an Alternative to Dimensional Scaling
The technology driver for the next decade is AI. Quoting Applied Materials CEO Gary Dickerson “Are we ready for the biggest opportunity of our lifetime?” - “Gary’s been traveling the world talking to chipmakers and policymakers about a $10 trillion question: how do we capture the economic opportunity of AI, which will transform nearly every industry and institution over the coming years?”
Gary presented the chart below presenting the 1,000x challenge to the semiconductors industry.
Fig. 1: The required improvement in power-performance, to support the demand of AI computing.
In fact the AI challenge is a moving target, as computing demands are growing by 2X about every 3.5 months.
Fig. 2 Uri Frank, Google VP Engineering at ChipEx 2021.
In recent years, there has been a buildup of tension in the US-China relation resulting in the US blocking China from securing access to advanced semiconductor technology and equipment. This includes access to advanced lithography tools such EUV. Accordingly, it was reported that only TSMC, Samsung and Intel have stayed in the race at technology node scaling below 10nm. Hence, it makes sense for Chinese firms to focus alternative resources on mature chip tech, say analysts.
This could explain the adoption of hybrid bonding as a core technology by multiple Chinese corporations. Hybrid bonding allows them to replace dimensional node scaling with system-level 3D scaling.
In August 2018, YMTC officially launched its ground-breaking Xtacking® architecture at the Flash Memory Summit, and won the “Best of Show” award. For its 3D NAND product, it uses two semiconductor manufacturing lines, one for the 3D NAND multi-level memory, and one for the peripheral (memory control) circuits as illustrated in Fig. 3 below.
Fig. 3 Xtacking – using hybrid bonding to stack periphery on top of the 3D NAND memory fabric
In September 2020 another Chinese company, IC League, published the results of their Heterogeneous Integration Technology on Chip (HITOC) Technology, an AI oriented IC development, in a paper titled Breaking the Memory Wall for AI Chip with a New Dimension .
Fig. 4 IC League’s Heterogeneous Integration on Chip (HITOC) Technology
Quoting from the paper: “With HITOC, we have two wafers, logic wafer, and memory wafer, bonded (using Hybrid Bonding) together [Fig. 4]. On the logic wafer, we have pools of processing units. Underneath the logic pool on the other wafer are pools of DRAM arrays.” The results reported by IC League were of better than orders of magnitude of overall improvement as could be seen in the table below.
Fig. 5 HITOC technology the Sunrise device vs. conventional 2D alternative devices
Last week at ISSCC 2022, Alibaba presented a more than 1,000 x improvement for AI computing devices using Hybrid Bonding in a paper titled “184QPS/W 64Mb/mm2 3D Logic-to-DRAM Hybrid Bonding with Process-Near- Memory Engine for Recommendation System.”
The paper rightly points out that for AI computing data transfer dominates the system performance and power consumption. Consequently, overcoming the “Memory Wall” is key for AI computing and even more so with the rapid escalation of the size of the AI model computation requirements as illustrated in Fig. 6 below.
Fig. 6 Escalation of AI module size compared to logic and memory annual technology improvements.
The paper details the device architecture which leverages Hybrid Bonding to connect from the multi-bank DRAM directly to the AI processors’ logic. A die size of DRAM in commodity market is fairly tiny such as smaller than 50 mm2 in part due to higher yield and constraint of JEDEC standard. Interestingly, Alibaba’s logic-to-DRAM 3D chip is truly large chip; 602.22 mm2. By doing so, an important aspect of this work was to architect the logic and the corresponding DRAM as a full system design with multiple DRAM banks directly connecting to the multi cores logic underneath. Then, we can even extend this 3D Logic-to-DRAM concept in full wafer scale chip like Cerebra’s Wafer-Scale-Engine (CS-2). Unfortunately, Cerebras wafer-scale-engine is currently using only SRAM. Imagine what if a fully DRAM wafer would be directly hybrid bonded on Cerebra’s wafer scale engine. The company disclosed that their CS-2 has 40 gigabytes of on-chip SRAM. At the same size, DRAM can easily offer easily greater than 1 terabyte or at least 25 times greater capacity. Now, we are a step closer to breaking memory wall.
Fig. 7 Illustration of system integration flow and the high level architecture
Alibaba’s paper title suggests that the work targets the AI segment of Recommendation Systems, in which Alibaba has a high interest and has been developing systems including publishing work since at least 2017. This paper presents a very important breakthrough in performance and power reduction. Quoting from the paper: “"Compared to the CPU-DRAM system, our chip achieves 9.78× speedup. Note that the throughput and memory capacity can be further improved by scaling up the number of hybrid bonding blocks or using more advanced process technologies to serve more complicated recommendation models. In terms of energy efficiency, which is significant in memory-bound applications, our work achieves 184.11QPS/W (QPS – Queries per Second), which outperforms the CPU-DRAM system by 317.43×. In terms of area efficiency, the high-density hybrid bonding improves QPS/mm2 by 660×." The results were achieved while using a relatively old process node of 55 nm for the logic and were compared with the top of the line Intel Xeon Gold CPU processed at 14 nm.
Fig. 8 Hybrid Bonding performance and energy improvements presented at ISSCC 2022.
These results are multiple orders of magnitude better than the result reported by AMD for their V-Cache using hybrid bonding for added cache memory to their Ryzen CPU. There could be few reasons for the difference including the effort in re-architecting the system to highly leverage the Hybrid Bonding technology. The Alibaba chip was certainly architected from the ground up anticipating hybrid bonding, where the AMD combination may have been an afterthought. Further it should be noted that while AMD reported utilizing a vertical connectivity pitch of 9µm, the Chinese vendors are reporting vertical pitch of 3µ and in some cases even 1µ.
The Sputnik Surprise – On February 7, 1958, the US established DARPA “to keep that technological superiority in the hands of the United State”, this is even more important now as we consider the potential power of AI technologies for the coming years.
"Dr. Zvi Or-Bach presented a talk for the SSCS Romania Chapter on March 24th, 2021."
"A vision for: New Types of Programmable Fabric 3d FPGA"
"In this book, a global team of experts from academia, research institutes and industry presents their vision on how new nano-chip architectures will enable the performance and energy efficiency needed for AI-driven advancements in autonomous mobility, healthcare, and man-machine cooperation." (Springer Link)
Below are presented sections from chapters written by us, from the book. To get the full chapters, access the following link NANO-CHIPS 2030 or alternatives such as Amazon.
"The Continuing Evolution of Moore's Law" blog written by Dr. Michael Mayberry, the chief technology officer of Intel Corporation, was posted on August 2nd on EE Times website. He wrote that in the future the logic will move more toward 3D.
"We moved to 3D starting with Trigate (FinFET) at 22 nm node, but an even better example is our announcement in May of a 96-layer, 4-bit-per-cell NAND flash that packs up to 1 terabit of information per die. This is a true post-Dennard example of packing increasing functions into a die without feature scaling. Over time, we expect logic to also move more toward 3D." [read full article]
EV Group (EVG), a supplier of wafer bonding and lithography equipment for the MEMS, nanotechnology and semiconductor markets, today unveiled the new SmartView® NT3 aligner, which is available on the company’s industry benchmark GEMINI® FB XT integrated fusion bonding system for high-volume manufacturing (HVM) applications. Developed specifically for fusion and hybrid wafer bonding, the SmartView NT3 aligner provides sub-50-nm wafer-to-wafer alignment accuracy — a 2-3X improvement — as well as significantly higher throughput (up to 20 wafers per hour) compared to the previous-generation platform. - Original article from Solid State Technology Magazine.
For over 4 decades the gap between computer processing speed and memory access has grown at about 50% per year, to more than 1,000x today. This provides an excellent opportunity to enhance the single-core system performance. An innovative 3D integration technology combined with re-architecting the integrated memory device is proposed to bridge the gap and enable a 1,000 x improvement in computer systems. The proposed technology utilizes processes that are widely available and could be integrated in products within a very short time.
Keywords—processor-memory gap; 3D memory;
The Fig. 1 illustrates the growing gap between processing and memory access [1,2]
The source for this gap is directly related to the gap between transistor performance progress vs. on-chip interconnect delay as illustrated in Fig. 2 [3,4]
In most computer systems the processor is being bought from a processor vendor (Intel, AMD, Nvidia,…) while the memory is being bought from memory vendors (Micron, Samsung, Hynix,…). In many cases multiple memory devices are being integrated into a memory module, often called DIMM (such as Fig. 3), which is then integrated into a computer system with the processor(s). This integration is driven by the very different semiconductor technology knowhow, and manufacturing infrastructure required, for processors vs. memories.
Printed circuit board (‘PCB’) is used to connect processor to memory adding significantly to the ‘gap,’ as illustrated in Fig. 4 
This gap has been articulated in many forms including in a presentation titled “Why we need Exascale and why we won't get there by 2020”  which included Figs. 5. There is a tremendous power cost in moving data to and from an off-chip memory, and even an ‘integrated’ memory.
2. PREVIOUSLY PROPOSED SOLUTIONS
A joint research by teams from Stanford University, Berkeley and Carnegie Mellon was summarized in their paper “Energy-Efficient Abundant-Data Computing: The N3XT 1,000×” illustrated in Fig. 6 .
Yet it is very unclear if and when any of the proposed new technologies – CNT, RRAM, STT-MRAM - would be available and in high volume production and ready to be used for manufacturing of computer systems.
3D integration using TSV has been considered as another attractive option to bridge the gap. A paper by D. H. Woo concluded “On average, for single-threaded memory intensive applications, the speedups range from 1.53 to 2.14 compared to a conventional 2D architecture” . This is far less than the monolithic 3D work done by Stanford. In Fig. 12 Micron provides a chart comparing TSV vs. DDR3. Micron has formed an industry consortium named Hybrid Memory Cube – “HMC,” to leverage TSV for bridging the memory gap.
A table comparing 2.5D and TSV memory stacking technologies is presented in Fig. 7.
While the various approaches offer clear performance improvements, they are far from what is suggested by the N3XT factor of 1,000x. It would seem that the inherent limitations (density, delay) of the TSV technology are the limiting factors. We therefore propose a novel 3D integration that offers more than 1,000x better vertical connectivity to enable comprehensive bridging of the processor memory gap, while still utilizing widely available commercial processes that make it attractive for fast industry adoption.
3. BRIDGING THE GAP WITH MONOLITHIC 3D INTEGRATION
In TSV technology the layer thickness of each layer in the stack is tens of microns (~50 μm) and the via through each stacked layer is about 5 microns in diameter. Monolithic 3D integration enables stacked layers of tens of nanometers (~50 nm) with vias that are of similar size of regular vias between metal layers. All currently known 3D integration techniques that provide such vertical connectivity require major process and equipment changes at the wafer fab. The following approach is a monolithic 3D innovation that overcomes this challenge.
Or-Bach in “Modified ELTRAN® - A Game Changer for Monolithic 3D” presented a 3D integration that does not require process change but would require a special porous base substrate . Here we propose an alternative substrate could be far more readily available (Fig. 8).
Epitaxial wafer with silicon over SiGe are already the preferred substrates of future nodes for silicon nano-wires, with gate-all-around in which the SiGe layer is used as a sacrificial layer. We suggest a reverse use of such epitaxial layer, in which the SiGe would function as an etch-selective ‘cut’ layer by functioning as an etch stop for a back grinding and etch-back process sequence. The 3D technique of flip, bond and etch-back of an SOI donor had been practiced by MIT Lincoln Lab for many years [10, 11]. Use of a SiGe ‘cutable’ substrate is an attractive alternative as SOI wafers are quite expensive. The SiGe ‘cutable’ substrate could be processed as a regular wafer through the fab all through the process, including BEOL interconnection layers. Then it could be flipped over and bonded to a target wafer. A simple grind and etch-back operation using the SiGe layer an etch stop follows, which could be later removed by a follow-up etch step. Accordingly, the transferred layer could be made as thin as desired, down to less than 100 nm of silicon, removing by grinding and etch back almost all of the 700 micron of the original substrate. Fig. 16 illustrates selective etching of silicon allowing SiGe to serve as an etch stop [12, 13, 14].
Currently available production worthy wafer bonders can support such layer transfers with less than 200nm (3σ) misalignment .
This fine grained 3D integration could be used for reengineering memory product into two strata, one for arrays of bit cells and one for memory control circuitry, as is illustrated in Fig. 10.
Such densely connected 3D partitioning would reduce memory costs as bit-cell processing is completely disconnected and different from memory control circuitry processing. Manufacturing memory and logic on separate wafers would reduce overall costs, while enabling a paradigm shift in the logic and memory interface.
Having the memory peripheral circuitry not in the periphery of the device but rather on top of it, allows cost effective breaking up of the memory array into an array of small memory units, such as 200 μm x 200 μm units, each with its own word-lines and bit-lines.
Additionally, multiple memory strata could be vertically integrated to offer much larger memory capacity for the same footprint. Fig. 11 illustrates a region of such memory strata having two adjacent memory units with word-lines or bit-lines traveling in-between, and a per-layer selector, and nano-pads for vertical connectivity.
Fig. 12a illustrates a vertical connectivity region having landing pads large enough to cover the bonding misalignment. Fig. 12b illustrates the region with overlaying via connecting these pads to the word-lines or the bits-lines, and Fig. 12c illustrates the overlying pinning pads.
Memory strata could be constructed by integrating such connectivity structure with the block diagram of Fig. 11 to form a stackable memory structure, which could be produced as a generic memory substrate. Hybrid wafer bonding could be used to provide vertical connectivity between layers in the stack. The required distance between two adjacent units (Fig. 11) could be less than 1 μm, resulting in less than 0.5% overhead for this 3D memory tiling and connectivity structure.
Fig. 13 illustrates a vertical cross-section view of 4 strata distributing the top select signal to each stratum in the stack.
This connectivity structure opens up many usage options, including redundancy to overcome defects. The memory is constructed by stacks of strata, each constructed as array of units connected in parallel with a select signal per stratum. One of these strata could serve as a redundancy stratum with per unit select allowing repair at the unit level. In addition, multiple memory access options could be enabled from high speed local access to a global—albeit somewhat slower—access to large array of units. An additional advantage of this architecture is having one (or two) memory control strata to service multiple memory strata.
Fig. 14 illustrates a 3D computer system utilizing the technologies presented here. The base silicon is a carrier substrate which also provides cooling to the main multi-core computing stratum. Through the first thermal isolation layer, the computer stratum is connected to the multi-unit lower memory control stratum, which controls the multi-unit memory array strata. Overlaying the memory strata is an upper memory control stratum which provides a second access to the same memory strata. Through a second thermal isolation layer a second computing stratum could be connected to the upper memory control stratum. The second computing stratum could be the Input/Output computing stratum communicating with external devices utilizing another communications stratum. The communications stratum could utilize wired, wireless, optical or other channels to communicate with external devices. An upper heat removal apparatus could overlay the communications stratum.
We have presented a 3D integration technology combined with memory architecture that could utilize existing processes and infrastructure to bridge the processor memory gap. Using these concepts would enable the current connectivity of about 100 wires at average 20 mm length to be replaced with 100,000 wires with average 20 μm length, with the corresponding 1,000x improvements in computation speed, power, and cost. Additional advantages would be reduction of overall system costs, establishing a generic memory fabric with build in full repair capability at factory and field.
 Hennessy, John L. and David A. Patterson. Computer Architecture: A Quantitative Approach. 4th ed., p. 289. Elsevier, 2007.
 McCalpin, John D. “Memory Bandwidth and System Balance in HPC Systems”, Invited Talk, International Conference for High Performance Computing, Networking, Storage, and Analysis, 2016.
 Wu, Banqiu, and Ajay Kumar. "Extreme ultraviolet lithography and three dimensional integrated circuit—A review." Applied Physics Reviews 1.1, 2014.
 Yeap, Geoffrey. "Smart mobile SoCs driving the semiconductor industry: Technology trend, challenges and opportunities." In Electron Devices Meeting (IEDM), 2013.
 Sun, Jack Y-C. "System scaling and collaborative open innovation." VLSI Technology (VLSIT), 2013 Symposium on. IEEE, 2013.
 Simon, Horst. "Why we need Exascale and why we won’t get there by 2020." Optical Interconnects Conference, Santa Fe, New Mexico. 2013.
 Aly, Mohamed M. Sabry, et al. "Energy-efficient abundant-data computing: The n3xt 1,000 x." Computer 48.12 (2015): 24-33.
 Woo, Dong Hyuk, et al. "An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth." High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010.
 Or-Bach, Zvi, et al. "Modified ELTRAN®—A game changer for Monolithic 3D." SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), 2015 IEEE. IEEE, 2015.
 Chen, C. K., et al. "3D-enabled heterogeneous integrated circuits." SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), 2013 IEEE. IEEE, 2013.
 Chen, C. K., et al. "SOI-enabled three-dimensional integrated-circuit technology." SOI Conference (SOI), 2010 IEEE International. IEEE, 2010.
 Orlowski, Marius, et al. "(Invited) Si, SiGe, Ge, and III-V Semiconductor Nanomembranes and Nanowires Enabled by SiGe Epitaxy." ECS Transactions 33.6 (2010): 777-789.
 Borenstein, J. T., et al. "Silicon germanium epitaxy: a new material for MEMS." MRS Proceedings. Vol. 657. Cambridge University Press, 2000.
 Taraschi, Gianni, et al. "Ultrathin strained Si-on-insulator and SiGe-on-insulator created using low temperature wafer bonding and metastable stop layers." Journal of The Electrochemical Society 151.1 (2004): G47-G56.
 Uhrmann, T. "Monolithic IC Integration-Key alignment specifications for high process yield." SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S), IEEE. 2014.
We have a guest contribution from Zvi Or-Bach, the President and CEO of MonolithIC 3D Inc.
Next week, as part of the IEEE S3S 2017 program, we will present a paper (18.3) titled “A 1,000x Improvement in Computer Systems by Bridging the Processor Memory Gap”.
Next week, as part of the IEEE S3S 2017 program, we will present a paper (18.3) titled “A 1,000x Improvement in Computer Systems by Bridging the Processor Memory Gap”. The paper details a monolithic 3D technology that is low-cost and ready to be rapidly deployed using the current transistor processes. In that talk, we will also describe how such an integration technology could be used to improve performance and reduce power and cost of most computer systems, suggestive of a 1,000x total system benefit. This game changing technology would be presented also in the CoolCube open workshop, a free satellite event of the conference 3DI program.
In an interesting coincidence DARPA just came out with a calls for >50x improvement in SoC
The 3DSoC DARPA solicitation reads: “As noted above, the 3DSoC technology demonstrated at the end of the program (3.5 Years) should also have the following characteristics:
–Capability of > 50X the performance at power when compared with 7nm 2D CMOS technology.
The 3DSoC program goal of 50x is to allow proposals suggesting US-built device at 90nm node vs. 7nm of computer chip using conventional 2D technologies. Looking at the table below we can see that if 7nm technology is used the benefit would be over 300x
This represents a paradigm shift for the computer industry and high-tech world, as normal scaling would provide 3x improvement at best. The emergence of AI and deep learning system makes memory access a key challenge for future systems, and indicate the far larger benefits offered by monolithic 3D integration.
The following charts were presented by the 3DSoC program manager Linton Salmon at the 3DSoC proposers day. The program calls for the use of monolithic 3D to overcome the current weakest link in computers – the memory wall.
Leading to the 3DSoC solicitation was work done by Stanford, MIT, Berkeley and Carnegie Mellon
Proposals are due by Nov 6.
There is a unique opportunity to hear the 3DSoC DARPA Program Manager, Dr. Linton Salmon, articulate the program and what DARPA is looking for during his invited talk at the S3S 2017 conference next week.
We have a guest contribution from Zvi Or-Bach, the President and CEO of MonolithIC 3D Inc.
As recently reported, in an effort to initiate resurgence of the U.S. electronics industry, some $500-$800 million will be invested in post-Moore's Law technologies.
Quoting from the BAA: “As noted above, the 3DSoC technology demonstrated at the end of the program (3.5 Years) should also have the following characteristics:
The following charts were presented by the 3DSoC program manager Linton Salmon at the 3DSOC proposal day. The program calls for the use of monolithic 3D to overcome the current weakest link in computers – the memory wall. This is illustrated in the following slides:
Leading to the 3DSoC solicitation was work done by Stanford, MIT, Berkeley and Carnegie Melon, summarrized in the following two slides.
Proposals are due by Nov 6. Everyone has a unique opportunity to hear an invited talk about the program on October 16 from the 3DSoC DARPA Program Manager, Dr. Linton Salmon during this year’s IEEE S3S conference at the Hyatt Regency at the San Francisco Airport. The IEEE S3S conference has dedicated focus on monolithic 3D technologies.
3D integration is considered expensive and monolithic 3D is considered extremely challenging requiring high technology, investment, and major changes to the foundry process. In an interesting entanglement, the S3S 2017 program includes a paper by us, MonolithIC 3D Inc., in which we will present a monolithic 3D technology that is low cost and ready to be rapidly deployed using the current transistor processes. In that talk we will also describe how such an integration technology could be used to improve performance and reduce power and cost of most computer systems, suggestive of a 1,000x total system benefit. I hope to see you there to talk further about this upcoming disruptive change.
Meet the Bloggers