DATE 2025 Detailed Programme

The detailed programme of DATE 2025 will continuously be updated.

More information on ASD Initiative, Keynotes, Tutorials, Workshops, Young People Programme

Navigate to Monday, 31 March 2025 | Tuesday, 01 April 2025 | Wednesday, 02 April 2025.


Monday, 31 March 2025

OC Opening Ceremony

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 08:30 CEST - 09:00 CEST


OK01 Opening Keynote 1

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 09:00 CEST - 09:45 CEST

Time Label Presentation Title
Authors
09:00 CEST OK01.1 TOWARDS GREENER ELECTRONICS AND A 1000X GAIN IN ENERGY EFFICIENCY: CO-OPTIMIZING INNOVATIVE IC ARCHITECTURES, DISRUPTIVE CMOS TECHNOLOGIES AND NEW EDA TOOLS
Presenter:
Jean-René Lèquepeys, CEA-Leti, FR
Author:
Jean-René Lèquepeys, CEA-Leti, FR
Abstract
Semiconductors and chips are ever-present in our current digital world. From smart sensors and the industrial Internet of Things to Digital Cities, personalized Medecine, Precision Agriculture, Vehicle Automation, and Cloud & High Performance Computing, semiconductor applications cover a very wide spectrum of society's needs. However, global warming is highlighting the social and environmental impact of the digital transition, and the complex trade-offs and choices that lie ahead if we are to build a sustainable world. How do we pursue digitalization taking into account a limited power budget and the planetary limits? How do we make greener choices in the face of ever-increasing/aggressive competition? How do we choose the right digital performance for each application instead of a one-size-fits-all scenario, with a best performance for all approach? The semiconductor ecosystem is indeed facing a difficult dilemma with complex key tradeoffs. With these stakes clearly in mind, the semiconductor community is performing disruptive research to provide greener electronics, able to attain very large gains in energy efficiency and just the right performance for each application. With the help of AI-boosted design methodologies and CAD tools, we have set out to co-optimize innovative CMOS technologies, disruptive chip architectures, computing models with new algorithms for embedded software. This paper will provide an overview of the global semiconductor landscape and the challenge of mastering the data deluge for the entire semiconductor ecosystem. In order to face this challenge, we must all work together to reduce the collection, transport and storage of fruitless data. This keynote will spend some time describing recent results from CEA-Leti and CEA-List's research on sustainable and greener technologies. To conclude, I will present an overview of the European Chips Act initiative, with the launch of the pilot lines, the Design Platforms and Competence Centers, a pan-European program that will be driving key milestones in the next five years to accelerate the accomplishment of our common goal of a sustainable and sovereign digital Europe.

OK02 Opening Keynote 2

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 09:45 CEST - 10:30 CEST

Time Label Presentation Title
Authors
09:45 CEST OK02.1 A VISION OF SYSTEMS AND TECHNOLOGY IN A CONNECTED EUROPE
Presenter:
Giovanni De Micheli, EPFL, CH
Author:
Giovanni De Micheli, EPFL, CH
Abstract
The unprecedented growth of electronic system applications, from AI to smart products, creates both a huge market opportunity and a deep need for talented engineers. Europe will play a dominant role in the thirties if we (i.e., our community) can set up the premises for such a technology expansion now. Whereas the European Chip Act is an important enabler, finance represents only one of the necessary conditions for success. The key aspect is the ability to leverage diverse competences and connect the partially-untapped energies of the various European players, ranging from Industry to Academia. Europe's strength stems from diversity and the ability to design complex systems from parts, possibly coming from various sources. The ‘value added' comes from the engineers who can create functionality and services, and who can adapt it to a diverse market of consumers. Yet I argue that this precious resource, the human capital represented by engineers and technologists, is too scarce and its limitation in size is a main handicap for creating a strong market of intelligent products and services. Education of engineers has to evolve and concentrate on the broader issue of system problem solving based on a deep understanding of technology. Industry has to join forces with academia by sharing knowledge and objectives and by creating a strong enthusiasm for engineering.

ASD01 ASD technical session: Enhancing Dependability and Efficiency in Automotive and Autonomous Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST

Session chair:
Selma Saidi, TU Braunschweig, DE

Session co-chair:
Dirk Ziegenbein, Robert Bosch GmbH, DE

This session explores advancements in automotive and autonomous systems, focusing on achieving predictability, reliability, and efficiency. The session begins with a proposal on extending the AUTOSAR Adaptive standard using the System-Level Logical Execution Time (SL-LET) paradigm to ensure determinism, critical for the predictability of modern automotive systems. The second presentation demonstrates noise perturbation attacks on image segmentation, a core perception component of safety-critical autonomous systems, and how they can be predicted and mitigated. Finally, a framework designed to optimize the efficiency of 3D object detection in autonomous vehicles through pattern pruning and quantization is presented, significantly enhancing real-time performance and energy efficiency on resource-limited platforms.

Time Label Presentation Title
Authors
11:00 CEST ASD01.1 MODELING THE SL-LET PARADIGM IN AUTOSAR ADAPTIVE
Speaker:
Davide Bellassai, Scuola Superiore Sant'Anna, IT
Authors:
Davide Bellassai1, Gerlando Sciangula2, Claudio Scordino3, Daniel Casini4 and Alessandro Biondi4
1Evidence S.r.l., Scuola Superiore Sant'Anna, IT; 2Huawei and Scuola Superiore Sant'Anna, IT; 3Huawei Inc, IT; 4Scuola Superiore Sant'Anna, IT
Abstract
The AUTOSAR consortium has proposed the AUTOSAR Adaptive standard to tackle the challenges introduced by the design of modern automotive functionality. It consists of a service-oriented architecture (SoA) implemented in C++ and built on top of POSIX operating systems. However, unlike the previous AUTOSAR Classic specifications, this novel standard does not address non-functional requirements, including determinism, which is of key importance to guarantee the system's functional safety. This paper proposes extensions to the AUTOSAR Adaptive standard to achieve determinism by leveraging the System-Level Logical Execution Time (SL-LET) paradigm, which is already used in the context of AUTOSAR Classic but needs to be revisited to be employed in Adaptive. We evaluate the feasibility of the proposed model extension on the AUTOSAR Adaptive Platform Demonstrator (APD), which provides an implementation of AUTOSAR Adaptive specifications using a realistic automotive application.
11:30 CEST ASD01.2 GENERATING AND PREDICTING OUTPUT PERTURBATIONS IN IMAGE SEGMENTERS
Speaker:
Bryan Donyanavard, San Diego State University, US
Authors:
Matthew Bozoukov1, Nguyen Anh Vu Doan2 and Bryan Donyanavard3
1Miramar college, US; 2Infineon Technologies AG & TU Munich, DE; 3San Diego State University, US
Abstract
Image segmentation applications are a core component of safety-critical autonomous software pipelines. Sensor data input noise can lead to segmentation output corruption that threatens safety in both DNN- and transformer-based segmenters. Previous work has proposed methods for generating malicious noise to cause DNN- and transformer-based object detection and classification output corruption. We perform the same task for image segmentation applications using genetic algorithms for optimization. We then propose a novel method to predict whether an input image will yield a corrupted segmentation output due to noise. We evaluate the optimal noise generation and corruption prediction on state-of-the-art image segmenters YOLOv8 and DETR. We observe that we can (a) cause segmentation output corruption with noise that is undetectable to the human eye and unrelated to the corrupted region of the image; and (b) predict output corruption due to image noise with over 96\% accuracy.
12:00 CEST ASD01.3 UPAQ: A FRAMEWORK FOR REAL-TIME AND ENERGY-EFFICIENT 3D OBJECT DETECTION IN AUTONOMOUS VEHICLES
Speaker:
Abhishek Balasubramaniam, colorado state university, US
Authors:
Abhishek Balasubramaniam1, Febin Sunny2 and Sudeep Pasricha3
1Colorado State University, N/; 2AMD, N/; 3Colorado State University, US
Abstract
To enhance perception in autonomous vehicles (AVs), recent efforts are concentrating on 3D object detectors, which deliver more comprehensive predictions than traditional 2D object detectors, at the cost of increased memory footprint and computational resource usage. We present a novel framework called UPAQ, which leverages semi-structured pattern pruning and quantization to improve the efficiency of LiDAR point-cloud and camera-based 3D object detectors on resource-constrained embedded AV platforms. Experimental results on the Jetson Orin Nano embedded platform indicate that UPAQ achieves up to 5.62× and 5.13× model compression rates, up to 1.97× and 1.86× boost in inference speed, and up to 2.07× and 1.87× reduction in energy consumption compared to state-of-the-art model compression frameworks, on the Pointpillar and SMOKE models respectively.

BPA01 BPA Session 1

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST BPA01.1 QUANTIFYING TRADE-OFFS IN POWER, PERFORMANCE, AREA, AND TOTAL CARBON FOOTPRINT OF FUTURE THREE-DIMENSIONAL INTEGRATED COMPUTING SYSTEMS
Speaker:
Danielle Grey-Stewart, Harvard University, US
Authors:
Danielle Grey-Stewart, Mariam Elgamal, David Kong, Georgios Kyriazidis, Jalil Morris and Gage Hills, Harvard University, US
Abstract
To address computing's carbon footprint challenge, designers of computing systems are beginning to consider carbon footprint as a first-class figure of merit, alongside conventional metrics such as power, performance, and area. To account for total carbon (tC) footprint of a computing system, carbon footprint models must consider both embodied carbon (Cembodied) due to emissions during manufacturing, and operational carbon (Coperational) from day-to-day use. Models for Coperational are relatively mature due to the direct relationship between Coperational and energy consumed while computing. In contrast, models for Cembodied primarily focus on today's silicon-based technologies, not capturing the wide range of beyond-Si technologies that are actively being developed for future computing systems, including emerging nanomaterials, emerging memory devices, and various three-dimensional (3D) integration techniques. Cembodied models for emerging technologies are essential for accurately predicting which technology directions to pursue without exacerbating computing's carbon footprint. In this paper, we (1) develop Cembodied models for 3D-integrated computing systems that leverage emerging nanotechnologies. We analyze an example fabrication process that is highly promising for energy-efficient computing: 3D integration of carbon nanotube field-effect transistors (CNFETs) and indium gallium zinc oxide (IGZO) FETs fabricated directly on top of Si CMOS at a 7 nm technology node. We show that Cembodied of this process is, on average (considering various energy grids), 1.31× higher per wafer vs. a baseline 7 nm node Si CMOS process. (2) As a case study, we quantify trade-offs in power, performance, area, and tC footprint for an embedded system comprising an ARM Cortex-M0 processor and embedded DRAM, implemented in each of the above processes. For a representative lifetime of the system (running applications from the Embench suite for 2 hours per day over 24 months, with a clock frequency of 500 MHz), we show that the 3D IGZO/CNFET/Si implementation is 1.02× more carbon-efficient per good die (considering yield) vs. the baseline Si implementation, quantified by the product of tC and application execution time (tCDP, an effective metric of carbon efficiency). (3) Finally, we show techniques to quantify carbon efficiency benefits of future computing systems, even when there is uncertainty in carbon footprint models. Specifically, we show how to robustly compare tCDP for multiple computing systems, given underlying uncertainty in Cembodied, computing system lifetime, carbon intensity (in equivalent grams of CO2 emissions per unit energy consumption), and yield.
11:20 CEST BPA01.2 COMPUTE-IN-MEMORY ARRAY DESIGN USING STACKED HYBRID IGZO/SI EDRAM CELLS
Speaker:
Munhyeon Kim, Seoul National University, KR
Authors:
Munhyeon Kim1, Yulhwa Kim2 and Jae-Joon Kim1
1Seoul National University, KR; 2Sungkyunkwan University, KR
Abstract
To effectively accelerate neural networks in compute-in-memory (CIM) systems, higher memory cell density is critical for managing increasing computational workloads and parameters. While CMOS-based embedded dynamic random access memory (eDRAM) is being explored as an alternative, addressing the short retention time (tret) (<1 ms) remains a challenge for system applications. Recent studies highlight that InGaZnO (IGZO)-based eDRAM achieves a significantly longer retention time (>100 s), but additional improvements are needed due to considerable cell variability and slower operating speeds compared to CMOS-based cells. This paper proposes a 3T-based stacked hybrid IGZO/Si eDRAM (Hybrid-3T) cell and array design for CIM systems, alongside a system-level evaluation for deep neural network (DNN) workloads. The Hybrid-3T cell, built on 7-nm FinFET technology, extends the retention time by 100 s compared to IGZO-based 3T eDRAM (IGZO-3T). It also provides 3.4× higher bit cell density compared to 8T SRAM cells and 2× higher density than CMOS-based 3T eDRAM (CMOS-3T), while maintaining similar throughput and variability levels as eDRAM and SRAM systems. Additionally, DNN inference accuracy for vision and natural language processing (NLP) tasks is evaluated using the proposed CIM design, considering the impact of enhanced cell variability and retention time on system-level performance. The retention time required for CIM operation accuracy (tret,CIM) is more than 10^7 times longer in Hybrid-3T than in CMOS-3T, and the retention time accounting for variability (tret,CIM v) is over 3× longer than IGZO-3T eDRAM. Consequently, the proposed Hybrid-3T eDRAM CIM integrates the strengths of both CMOS-3T and IGZO-3T CIM designs, enabling high-performance, reliable systems.
11:40 CEST BPA01.3 TIMING-DRIVEN GLOBAL PLACEMENT BY EFFICIENT CRITICAL PATH EXTRACTION
Speaker:
Yunqi Shi, Nanjing University, CN
Authors:
Yunqi Shi1, Siyuan Xu2, Shixiong Kai2, Xi Lin1, Ke Xue1, Mingxuan Yuan3 and Chao Qian1
1Nanjing University, CN; 2Huawei Noah's Ark Lab, CN; 3Huawei Noah's Ark Lab, HK
Abstract
Timing optimization during the global placement of integrated circuits has been a significant focus for decades, yet it remains a complex, unresolved issue. Recent analytical methods typically use pin-level timing information to adjust net weights, which is fast and simple but neglects the path-based nature of the timing graph. The existing path-based methods, however, cannot balance the accuracy and efficiency due to the exponential growth of number of critical paths. In this work, we propose a GPU-accelerated timing-driven global placement framework, integrating accurate path-level information into the efficient DREAMPlace infrastructure. It optimizes the fine-grained pin-to-pin attraction objective and is facilitated by efficient critical path extraction. We also design a quadratic distance loss function specifically to align with the RC timing model. Experimental results demonstrate that our method significantly outperforms the current leading timing-driven placers, achieving an average improvement of 40.5% in total negative slack (TNS) and 8.3% in worst negative slack (WNS), as well as an improvement in half-perimeter wirelength (HPWL).

CFP Panel on Career Perspectives

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:00 CEST


ET01 Agile Hardware Specialization: A toolbox for Agile Chip Front-end Design

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST

Compared to software design, hardware design is more expensive and time-consuming. This is partly because software community has developed a rich set of modern tools to help software programmers to get projects started and iterated easily and quickly. However, the tools are seriously antiquated and lacking for hardware design. Modern digital chips are still designed manually using hardware description language such as Verilog or VHDL, which requires low-level and tedious programming, debugging, and tuning. In this tutorial, we will introduce Agile Hardware Specialization (AHS): A toolbox for Agile Chip Front-end Design.

The tutorial will highlight the methodology and open source tools in AHS for both chip design and verification. From the design perspective, AHS presents multiple ways that use different programming interfaces and target different scenarios, including;

  1. a multi-level hardware intermediate representation based high-level synthesis flow, which uses C and C++ as the programming language. This flow also supports domain specific language and optimization for specific domains such as tensor algebra. We also design an efficient cross-level debugger for high-level synthesis that enables breakpoints and stepping at different hardware intermediate representations.
  2. an embedded hardware description language, which uses Rust as the programming language. This flow includes a general HDL and provides deterministic timing support and procedural control logic specification.

These different methodologies exhibit different trade-offs in productivity and PPA (performance, power, and area) for chip design. From the verification perspective, we will present agile simulation and debugging tools, which can check the functional and performance behaviors of the hardware. The attendees will learn the methodology, design automation fundamentals, and software tools of AHS.

Speakers

Dr. Yun Liang, Professor, Peking University, China

Xiaochen Hao, Ph.D Candidate, Peking University, China

Target Audience

We invite DATE 2025 participants with a keen interest in chip design and verification and computer-aided design (CAD) tools. Please join us!

Learning objectives

  • An introduction to the AHS framework.
  • Details of the AHS tools including Hector, Hestia, Cement, Khronos, etc.
  • Hands-on experimentation using AHS tools.
  • Motivate future research within the AHS framework.

Required Background

  • Basic knowledge of programming such as C or C++
  • A keen interest in learning hardware specialization and Electronic Design Automation (EDA)
  • Desirable: prior knowledge of high level synthesis and associated tool-chains.

Detailed Program

  • Part 1: Lecture (1 hour)
  • Part 2: Hands-on session (30 mins)

Lab installation instructions and handouts are available at: https://ericlyun.me/tutorial-date2025


FS01 Focus session - Specifications Mining in a World of Generative AI: Extensions, Applications, and Pitfalls

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST

Session chair:
Graziano Pravadelli, Università di Verona, IT

Session co-chair:
Badri Gopalan, Synopsys, US

Organisers:
Graziano Pravadelli, Università di Verona, IT
Samuele Germiniani, Marconi University of Rome, IT

The session consists of three technical contributions (15 mins each) and one panel (45 mins), totaling 90 minutes, focused on R&D challenges, emerging trends, and solutions for the automatic generation of formal specifications in system-level assertion-based verification (ABV). The first part of the session will explore the role of LLMs in assertion generation, delve into the automatic mining of assertions for security verification, and present a framework for the fair qualification and evaluation of current and future assertion miners. Then, in the second part, the panel will highlight unmet needs in specification mining, motivating researchers to develop new approaches and tools that move beyond academic proofs of concept and position automatic assertion generation as a practical, industry-ready solution for ABV.

Time Label Presentation Title
Authors
11:00 CEST FS01.1 ARE LLMS READY FOR PRACTICAL ADOPTION FOR ASSERTION GENERATION?
Speaker:
Debjit Pal, University of Illinois Chicago, US
Authors:
Vaishnavi Pulavarthi1, Deeksha Nandal2 and Debjit Pal2
1UIC, US; 2University of Illinois at Chicago, US
Abstract
Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, i.e., detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. With the onset of generative AI such as Transformers and Large-Language Models (LLMs), there has been a renewed interest in developing novel, effective, and scalable techniques of generating functional and security assertions from design source code. While there have been recent works that use commercial-of-the-shelf (COTS) LLMs for assertion generation, there is no comprehensive study in quantifying the effectiveness of LLMs in generating syntactically and semantically correct assertions. In this paper, we first discuss AssertionBench from our prior work, a comprehensive set of designs and assertions to quantify the goodness of a broad spectrum of COTS LLMs for the task of assertion generations from hardware design source code. Our key insight was that COTS LLMs are not yet ready for prime-time adoption for assertion generation as they generate a considerable fraction of syntactically and semantically incorrect assertions. Motivated by the insight, we propose AssertionLLM, a first of its kind LLM model, specifically fine-tuned for as- sertion generation. Our initial experimental results show that AssertionLLM considerably improves the semantic and syntactic correctness of the generated assertions over COTS LLMs.
11:22 CEST FS01.2 SECURITY ASSERTIONS FOR TRUSTED EXECUTION ENVIRONMENTS
Speaker:
Prabhat Mishra, University of Florida, US
Authors:
Hasini Witharana, Hansika Weerasena and Prabhat Mishra, University of Florida, US
Abstract
Trusted Execution Environment (TEE) provides a secure and isolated execution environment for sensitive applications. In order to design secure and trustworthy TEE-based systems, it is crucial to verify the trustworthiness of TEE implementations. Property checking is a promising avenue to guarantee that the TEE implementation satisfies the security properties. In the presence of a vulnerability, property checking will fail and provide a counterexample that can be utilized to fix the vulnerability. A major challenge in TEE property checking is that it relies on manual definition of the security properties, which can be cumbersome and error-prone. In this paper, we propose an efficient framework for automated generation and verification of TEE specific properties. Specifically, we leverage Finite State Machine (FSM) analysis to automatically derive and validate security properties utilizing templates. The effectiveness of the proposed method is demonstrated through experimental evaluation of Intel Trust Domain Extension (TDX), highlighting its potential for verifying security and trustworthiness of modern trusted execution environments.
11:45 CEST FS01.3 A BASELINE FRAMEWORK FOR THE QUALIFICATION OF SPECIFICATIONS MINERS
Speaker:
Samuele Germiniani, University of Guglielmo Marconi and University of Verona, IT
Authors:
Samuele Germiniani, Daniele Nicoletti and Graziano Pravadelli, Università di Verona, IT
Abstract
Over the past few decades, the verification community has developed several specification miners as an alternative to manual assertion definition. However, assessing their effectiveness remains a challenging task. Most studies evaluate these miners using predefined ranking metrics, which often fail to ensure the quality of the inferred specifications, especially when no fixed ground truth exists and the relevance of the specifications varies depending on the use case. This paper presents a comprehensive framework aimed at facilitating the evaluation and comparison of LTL specification miners. Unlike traditional approaches, which struggle with subjective analyses and complex tool configurations, our framework provides a structured method for assessing and comparing the quality of specifications generated by multiple sources, using both semantic and syntactic techniques. To achieve this, the framework offers users an easy-to-extend environment for installing, configuring, and running third-party miners via Docker containers. Additionally, it supports the inclusion of new evaluation methods through a modular design. Miner comparison can be based either on user-defined designs or on synthetic benchmarks, which are automatically generated to serve as a non-subjective ground truth for the evaluation of the miners. We demonstrate the utility of our framework through comparative analyses with four well-known LTL miners, illustrating its ability to standardize and enhance the specification mining evaluation process.
12:07 CEST FS01.4 SPECIFICATION MINING FACING GENERATIVE AI
Speaker:
Goerschwin Fey, TU Hamburg, DE
Authors:
Goerschwin Fey1, Harry Foster2, Tara Ghasempouri3, Badri Gopalan4, Joerg Mueller5 and Manish Pandey4
1TU Hamburg, DE; 2Siemens/Mentor Graphics, US; 3Department of Computer System, Tallinn University of Technology, Estonia, EE; 4Synopsys, US; 5Formal Verification Expert, DE
Abstract
Specifications for complex designs and their consistency are always a headache. Automated specification mining – including but not limited to generative AI – offers attractive solutions, but there are also various unmet needs.

LKS01 Later … with the keynote speakers

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:00 CEST


TS01 Emerging design technologies for future computing

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS01.1 OPTIMAL SYNTHESIS OF MEMRISTIVE MIXED-MODE CIRCUITS
Speaker:
Ilia Polian, University of Stuttgart, DE
Authors:
Ilia Polian1, Xianyue Zhao2, Li-Wei Chen1, Felix Bayhurst1, Ziang Chen2, Heidemarie Schmidt2 and Nan Du2
1University of Stuttgart, DE; 2University of Jena and Leibniz Institute of Photonic Technology, Jena, Germany, DE
Abstract
Memristive crossbars are attractive for in-memory computing due to their integration density combined with compute and storage capabilities of their basic devices. However, yield and fidelity of emerging memristive technologies can make their reliable operation unattainable, thus raising interest in simpler topologies. In this paper, we consider synthesis of Boolean functions on 1D memristive line arrays. We propose an optimal procedure that can fully utilize the rich electrical behavior of memristive devices, mixing stateful (resistance-input) and nonstateful (voltage-input) operations as desired by the designer, leveraging their respective strengths. The synthesis method is based on Boolean satisfiability (SAT) solving and supports flexible constraints to enforce, e.g., restrictions of the available peripherals. We experimentally validate memristive logic circuits beyond individual logic gates by demonstrating the operation of a Galois field multiplier using a 1D line array of 10 memristors in parallel, highlighting the robust performance of our proposed mixed-mode circuit and its synthesis procedure.
11:05 CEST TS01.2 NVCIM-PT: AN NVCIM-ASSISTED PROMPT TUNING FRAMEWORK FOR EDGE LLMS
Speaker:
Ruiyang Qin, University of Notre Dame, US
Authors:
Ruiyang Qin1, Pengyu Ren1, Zheyu Yan2, Liu Liu1, Dancheng Liu3, Amir Nassereldine3, Jinjun Xiong3, Kai Ni1, X. Sharon Hu1 and Yiyu Shi1
1University of Notre Dame, US; 2Zhejiang University, CN; 3University at Buffalo, US
Abstract
Large Language Models (LLMs) deployed on edge devices, known as edge LLMs, only use constrained resources to learn from user-generated data. Although existing learning methods have demonstrated performance improvements for edge LLMs, their constraints in high resource cost and low learning capacity limit their effectiveness as optimal learning methods for edge LLMs. Prompt tuning (PT), a learning method without these constraints, has significant potential to improve edge LLM performance while modifying only a small portion of LLM parameters. However, PT-based edge LLMs can suffer from user domain shift, leading to repetitive training that neither effectively improves performance nor resource efficiency. Conventional efforts to address domain shifts involve more complex neural network designs and sophisticated training, inevitably resulting in higher resource usage. It remains an open question: how can we avoid domain shift and high resource usage for edge LLM PT? In this paper, we propose a prompt tuning framework for edge LLMs, exploiting the benefits offered by non-volatile computing-in-memory (NVCiM) architectures. We introduce a novel NVCiM-assisted PT framework, where we narrow down the core operations to matrix-matrix multiplication, accelerated by performing in-situ computation on NVCiM. To the best of our knowledge, this is the first work employing NVCiM to improve the edge LLM PT performance.
11:10 CEST TS01.3 PICELF: AN AUTOMATIC ELECTRONIC LAYER LAYOUT GENERATION FRAMEWORK FOR PHOTONIC INTEGRATED CIRCUITS
Speaker:
Xiaohan Jiang, The Hong Kong University of Science and Technology, HK
Authors:
Xiaohan Jiang1, Yinyi Liu1, Peiyu Chen2, Wei Zhang1 and Jiang Xu2
1The Hong Kong University of Science and Technology, HK; 2The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
In recent years, the advent of photonic integrated circuits (PICs) has demonstrated great prospects and applications to address critical issues such as limited bandwidth, high latency, and high power consumption in data-intensive systems. However, the field of physical design automation for PICs remains in its infancy, with a notable gap in electronic layer layout design tools. Current research on PIC physical design automation primarily focuses on optical layer layouts, often overlooking the equally crucial electronic layer layouts. Although well-established for conventional integrated circuits (ICs), existing EDA tools are inadequately adapted for PICs due to their unique characteristics and constraints. As PICs grow in integration density and size, traditional manual-based design methods become increasingly inefficient and sub-optimal, potentially compromising overall PIC performance. To address this challenge, we propose PICELF, the first framework in the literature for automatic PIC electronic layer layout generation. Our framework comprises a nonlinear binary programming (NBP)-based netlist generator with scalability optimization and a two-stage router featuring initial parallel routing followed by post-routing optimization. We validate our framework's effectiveness and efficiency using a real PIC chip benchmark established by us. Experimental results demonstrate that our method can efficiently generate high-quality PIC electronic layer layouts and satisfy all design rules, within reasonable CPU times, while related existing methods are not applicable.
11:15 CEST TS01.4 SYSTEM LEVEL PERFORMANCE EVALUATION FOR SUPERCONDUCTING SYSTEMS
Speaker:
Debjyoti Bhattacharjee, IMEC, BE
Authors:
Joyjit Kundu, Debjyoti Bhattacharjee, Nathan Josephsen, Ankit Pokhrel, Udara De Silva, Wenzhe Guo, Steven Winckel, Steven Brebels, Manu Perumkunnil, Quentin Herr and Anna Herr, imec, BE
Abstract
Superconducting Digital~(SCD) technology offers significant potential for enhancing the performance of next generation large scale compute workloads. By leveraging advanced lithography and a 300 mm platform, SCD devices can reduce energy consumption and boost computational power. This paper presents an analytical performance modeling approach to evaluate the system-level performance benefits of SCD architectures for LLM training and inference. Our findings, based on experimental data and Pulse Conserving Logic~(PCL) design principles, demonstrate substantial improvements in both training and inference. SCD's ability to address memory and interconnect limitations positions it as a promising solution for next-generation compute systems.
11:20 CEST TS01.5 INTEGRATED HARDWARE ANNEALING BASED ON LANGEVIN DYNAMICS FOR ISING MACHINES
Speaker:
Hui Wu, University of Rochester, US
Authors:
Yongchao Liu, Lianlong Sun, Michael Huang and Hui Wu, University of Rochester, US
Abstract
Ising machines are non-von Neumann machines designed to solve combinatorial optimization problems (COP) by searching for the ground state, or the lowest energy configuration, within the Ising model. However, Ising machines often face the challenges of getting trapped in local minima due to the complex energy landscapes. Hardware annealing algorithms help mitigate this issue by using a probabilistic approach to steer the system toward the ground state. In this paper, we present a hardware annealing algorithm for Ising machines based on Langevin dynamics, a stochastic perturbation by random noise. Theoretical analysis, system-level design, and detailed circuit design are carried out. We evaluate the performance of the algorithm through chip-level simulation using a standard 65-nm CMOS technology to demonstrate the algorithm's efficacy. The results show that the proposed hardware annealing algorithm effectively guides the system to reach the ground state with a probability of 86.5\%, significantly improving the solution quality by 97.5\%. Further, we compare the algorithm with state-of-the-art hardware annealing methods through behavioral-level simulations, highlighting its improved solution quality alongside a 50\% reduction in time-to-solution.
11:25 CEST TS01.6 NORA: NOISE-OPTIMIZED RESCALING OF LLMS ON ANALOG COMPUTE-IN-MEMORY ACCELERATORS
Speaker:
Garrett Gagnon, Rensselaer Polytechnic Institute, US
Authors:
Yayue Hou1, Hsinyu Tsai2, Kaoutar El Maghraoui2, Tayfun Gokmen2, Geoffrey Burr2 and Liu Liu1
1Rensselaer Polytechnic Institute, US; 2IBM, US
Abstract
Large Language Models (LLMs) have become critical in AI applications, yet current digital AI accelerators suffer from significant energy inefficiencies due to frequent data movement. Analog compute-in-memory (CIM) accelerators offer a potential solution for improving energy efficiency but introduce non-idealities that can degrade LLM accuracy. While analog CIM has been extensively studied for traditional deep neural networks, its impact on LLMs remains unexplored, particularly concerning the large influence of Analog CIM non-idealities. In this paper, we conduct a sensitivity analysis on the effects of analog-induced noise on LLM accuracy. We find that while LLMs demonstrate robustness to weight-related noise, they are highly sensitive to quantization noise and additive Gaussian noise. Based on these insights, we propose a noise-optimized rescaling method to mitigate LLM accuracy loss by shifting the non-ideality burden from the sensitive input/output to the more resilient weight. Through rescaling, we can implement the OPT-6.7b model on simulated analog CIM hardware with less than 1% accuracy loss from the floating-point baseline, compared to a much higher loss of around 30% without rescaling.
11:30 CEST TS01.7 BOSON−1: UNDERSTANDING AND ENABLING PHYSICALLY-ROBUST PHOTONIC INVERSE DESIGN WITH ADAPTIVE VARIATION-AWARE SUBSPACE OPTIMIZATION
Speaker:
Haoyu Yang, Nvidia Inc., US
Authors:
Pingchuan Ma1, Zhengqi Gao2, Amir Begovic3, Meng Zhang3, Haoyu Yang4, Haoxing Ren4, Rena Huang3, Duane Boning2 and Jiaqi Gu1
1Arizona State University, US; 2Massachusetts Institute of Technology, US; 3Rensselaer Polytechnic Institute, US; 4NVIDIA Corp., US
Abstract
Nanophotonic device design aims to optimize photonic structures to meet specific requirements across various applications. Inverse design has unlocked non-intuitive, high-dimensional design spaces, enabling the discovery of compact, high-performance device topologies beyond traditional heuristic or analytic methods. The adjoint method, which calculates analytical gradients for all design variables using just two electromagnetic simulations, enables efficient navigation of this complex space. However, many inverse-designed structures, while numerically plausible, are difficult to fabricate and highly sensitive to physical variations, limiting their practical use. The discrete material distributions with numerous local-optimal structures also pose significant optimization challenges, often causing gradient-based methods to converge on suboptimal designs. In this work, we formulate inverse design as a fabrication-restricted, discrete, probabilistic optimization problem and introduce BOSON-1, an end-to-end, adaptive, variation-aware subspace optimization framework to address the challenges of manufacturability, robustness, and optimizability. With elegant reparametrization, we explicitly emulate the fabrication process and differentiably optimize the design in the fabricable subspace. To overcome optimization difficulty, we propose dense target-enhanced gradient flows to mitigate misleading local optima and introduce a conditional subspace optimization strategy to create high-dimensional tunnels to escape local optima. Furthermore, we significantly reduce the prohibitive runtime associated with optimizing across exponential variation samples through an adaptive sampling-based robust optimization method, ensuring both efficiency and variation robustness. On three representative photonic device benchmarks, our proposed inverse design methodology BOSON-1 delivers fabricable structures and achieves the best convergence and performance under realistic variations, outperforming prior arts with 74.3% post-fabrication performance.
11:35 CEST TS01.8 BIMAX: A BITWISE IN-MEMORY ACCELERATOR USING 6T-SRAM STRUCTURE
Speaker:
Nezam Rohbani, BSC, ES
Authors:
Nezam Rohbani1, Mohammad Arman Soleimani2, behzad salami3, Osman Unsal3, Adrian Cristal Kestelman3 and Hamid Sarbazi-Azad4
1Institute for Research in Fundamental Sciences (IPM), IR; 2Sharif University of Technology, IR; 3BSC, ES; 4Sharif U of Tech, IR
Abstract
In-memory computing (IMC) paradigm reduces costly and inefficient data transfer between memory modules and processing cores by implementing simple and parallel operations inside the memory subsystem. SRAM, the fastest memory structure in the memory hierarchy, is an appropriate platform to implement IMC. However, the main challenges of implementing IMC in SRAM are the limited operations and unreliable accuracy due to environmental noise and process variations. This work proposes a low-latency, energy-efficient, and noise-robust IMC technique, called Bitwise In-Memory Accelerator using 6T-SRAM Structure (BIMAX). BIMAX performs parallel bitwise operations (i.e., (N)AND, (N)OR, NOT, X(N)OR) as well as row-copy with the capability of writing the computation result back to a target memory row. BIMAX functionality is based on an imbalanced differential sense amplifier (SA) that reads and writes data from and into multiple 6T-SRAM cells. The simulations show BIMAX performs these operations with 52.7% lower energy dissipation compared to the state-of-the-art IMC technique, with 5.7% average higher performance rate. Furthermore, BIMAX is about 5.4× more robust against environmental noises compared to the state-of-the-art.
11:40 CEST TS01.9 DSC-ROM: A FULLY DIGITAL SPARSITY-COMPRESSED COMPUTE-IN-ROM ARCHITECTURE FOR ON-CHIP DEPLOYMENT OF LARGE-SCALE DNNS
Speaker:
Tianyi Yu, Tsinghua University, CN
Authors:
Tianyi Yu, Zhonghao Chen, Yiming Chen, Shuang Wang, Yongpan Liu, Huazhong Yang and Xueqing Li, Tsinghua University, CN
Abstract
Compute-in-Memory (CiM) is a promising technique to mitigate the memory bottleneck for energy-efficient deep neural network (DNN) inference. Unfortunately, conventional SRAM-based CiM has low density and limited on-chip capacity, resulting in undesired weight reloading from off-chip DRAM. The emerging high-density ROM-based CiM architecture has recently revealed the opportunity of deploying large-scale DNNs on-chip, with optional assisting SRAM to ensure moderate flexibility. However, prior analog-domain ROM CiM still suffers from limited memory density improvement and low computing area efficiency due to stringent array structure and large A/D converter (ADC) overhead. This paper presents DSC-ROM, a fully digital sparsity-compressed compute-in-ROM architecture to address these challenges. DSC-ROM introduces a fully synthesizable macro-level design methodology that achieves a record-high memory density of 27.9 Mb/mm^2 in a 28nm CMOS technology. Experimental results show that the macro area efficiency of DSC-ROM improves by 5.6-6.6x compared with prior analog-based ROM CiM. Furthermore, a novel weight fine-tuning technique is proposed to ensure task transfer flexibility and reduce required assisting SRAM cells by 94.4%. Experimental results show that DSC-ROM designed for ResNet-18 pre-trained on ImageNet dataset achieves <0.5% accuracy loss in CIFAR-10 and FER2013, compared with the fully SRAM-based CiM.
11:45 CEST TS01.10 COMPACT NON-VOLATILE LOOKUP TABLE ARCHITECTURE BASED ON FERROELECTRIC FET ARRAY THROUGH IN-SITU COMBINATORIAL ONE-HOT ENCODING FOR RECONFIGURABLE COMPUTING
Speaker:
Weikai Xu, Peking University, CN
Authors:
Weikai Xu, Meng Li, Qianqian Huang and Ru Huang, Peking University, CN
Abstract
Lookup tables (LUTs) are widely used for reconfigurable computing applications due to the capability of implementing arbitrary logic functions. Various emerging non-volatile memories (eNVMs) have been introduced for LUT designs with reduced hardware cost and power consumption compared with conventional SRAM-based LUT. However, the existing designs still follow the conventional LUT architecture, where the memory cells are only used for storage of configuration bits, requiring dedicated bulky multiplexer (MUX) for computation of each LUT, resulting in inevitable high area, latency, and energy cost. In this work, a compact and efficient non-volatile LUT architecture based on ferroelectric FET (FeFET) array is proposed, where the configuration bit storage and computation can be implemented within the FeFET array through in-situ combinatorial one-hot encoding, eliminating the need of costly MUX for each LUT. Moreover, multibit LUTs can be efficiently implemented in the FeFET array using only one shared decoder instead of multiple costly MUXs. Due to the eliminated MUX in the calculation path, the proposed LUT can also achieve enhanced computation speed compared with the conventional LUTs. Based on the proposed LUT architecture, the input expansion of LUT, full adder, and content addressable memory are further implemented and demonstrated with reduced hardware and energy cost. Evaluation results show that the proposed FeFET array-based LUT architecture achieves 51.7×/8.3× reduction in area-energy-delay product compared with conventional SRAM-based/FeFET-based LUT architecture, indicating its great potential for reconfigurable computing applications.
11:50 CEST TS01.11 GRAMC: GENERAL-PURPOSE AND RECONFIGURABLE ANALOG MATRIX COMPUTING ARCHITECTURE
Speaker:
Lunshuai Pan, Peking University, CN
Authors:
Lunshuai Pan, Shiqing Wang, Pushen Zuo and Zhong Sun, Peking University, CN
Abstract
In-memory analog matrix computing (AMC) with resistive random-access memory (RRAM) represents a highly promising solution that solves matrix problems in one step. However, the existing AMC circuits each have a specific connection topology to implement a single computing function, lack of the universality as a matrix processor. In this work, we design a reconfigurable AMC macro for general-purpose matrix computations, which is achieved by configuring proper connections between memory array and amplifier circuits. Based on this macro, we develop a hybrid system that incorporates an on-chip write-verify scheme and digital functional modules, to deliver a general-purpose AMC solver for various applications.
11:51 CEST TS01.12 SHWCIM:A SCALABLE HETEROGENEOUS WORKLOAD COMPUTING-IN-MEMORY ARCHITECTURE
Speaker:
Yanfeng Yang, School of Microelectronics, South China University of Technology, CN
Authors:
Yanfeng Yang1, Yi Zou2, Zhibiao Xue2 and Liuyang Zhang3
1School of Integrated Circuits, South China University of Technology, CN; 2School of Microelectronics, South China University of Technology, CN; 3School of Microelectronics, Southern University of Science and Technology, CN
Abstract
This study introduces HWCIM, a SRAM-based Computing-In-Memory core, and SHWCIM, a CIM-capable Coarse-Grained Reconfigurable Architecture, to enhance resource utilization, multi-functionality, and on-chip memory size in SRAM-based CIM designs. Evaluated using the SMIC 55nm process, HWCIM achieves 1.6× lower power, 2.8× higher energy efficiency, and up to 4.1× smaller area compared to previous CIM and CGRA works. Additionally, SHWCIM delivers an average 105.9× speedup over existing CGRAs and consumes 2–5× less energy than the Nvidia A40 GPU on realistic workloads.

TS02 Secure systems, circuits, and architectures

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS02.1 FLEXENM: A FLEXIBLE ENCRYPTING-NEAR-MEMORY WITH REFRESH-LESS EDRAM-BASED MULTI-MODE AES
Speaker:
Hyunseob Shin, Korea University, KR
Authors:
Hyunseob Shin and Jaeha Kung, Korea University, KR
Abstract
On-chip cryptography engines face significant challenges in efficiently processing large volumes of data while maintaining security and versatility. Most existing solutions support only a single AES mode, limiting their applicability across diverse use cases. This paper introduces FlexENM, a low-power and area-efficient near-eDRAM encryption engine. The FlexENM implements refresh-less operation by leveraging inherent characteristics of the AES algorithm, reordering AES stages, and employing a simultaneous read and write scheme using dual-port eDRAM. Furthermore, FlexENM supports three AES modes, parallelizing their operations and sharing hardware resources across different modes to improve compute efficiency. Compared to other AES engines, FlexENM achieves 16% lower power consumption and 83% higher throughput per unit area, on average, demonstrating improved power- and area-efficiency for on-chip data protection.
11:05 CEST TS02.2 PASTA ON EDGE: CRYPTOPROCESSOR FOR HYBRID HOMOMORPHIC ENCRYPTION
Speaker:
Aikata Aikata, TU Graz, AT
Authors:
Aikata Aikata1, Daniel Sobrino2 and Sujoy Sinha Roy1
1TU Graz, AT; 2Universidad Politécnica de Madrid, ES
Abstract
Fully Homomorphic Encryption (FHE) enables privacy-preserving computation but imposes significant computational and communication overhead on the client for the public-key encryption. To alleviate this burden, previous works have introduced the Hybrid Homomorphic Encryption (HHE) paradigm, which combines symmetric encryption with homomorphic decryption to enhance performance for the FHE client. While early HHE schemes focused on binary data, modern versions now support integer prime fields, improving their efficiency for practical applications such as secure machine learning. Despite several HHE schemes proposed in the literature, there has been no comprehensive study evaluating their performance or area advantages over FHE for encryption tasks. This paper addresses this gap by presenting the first implementation of an HHE scheme- PASTA. It is a symmetric encryption scheme over integers designed to facilitate fast client encryption and homomorphic symmetric decryption on the server. We provide its performance results for both FPGA and ASIC platforms, including a RISC-V System-on-Chip (SoC) implementation on a low-end 130nm ASIC technology, which achieves a 43–171x speedup compared to a CPU. Additionally, on high-end 7nm and 28nm ASIC platforms, our design demonstrates a 97x speedup over prior public-key client accelerators for FHE. We have made our design public and benchmarked an application to support future research.
11:10 CEST TS02.3 DESIGN, IMPLEMENTATION AND VALIDATION OF NSCP: A NEW SECURE CHANNEL PROTOCOL FOR HARDENED IOT
Speaker:
Vittorio Zaccaria, Politecnico di Milano, IT
Authors:
Joan Bushi1, Alberto Battistello2, Guido Bertoni2 and Vittorio Zaccaria1
1Politecnico di Milano, IT; 2Security Pattern, IT
Abstract
This paper deals with the design, implementation, and validation of a new secure channel protocol to connect microcontrollers and secure elements. The new secure channel protocol (NSCP) relies on a lightweight cryptographic primitive (Xoodyak) and simplified operating principles to provide secure data exchange. The performance of the new protocol is compared with that of GlobalPlatform's Secure Channel Protocol 03 (SCP03), the current extit{de facto} standard for hardening the connection between a microcontroller and a secure element in industrial IoT. The evaluation was performed in two scenarios where the secure element was emulated with an Arm Cortex M4 and a OpenHW RISC-V MPU synthesized on an Artix FPGA. The results of the evaluation are an indicator of the potential advantage of the new protocol over SCP03: In the best case, the new protocol is able to apply cryptographic protection to messages from 3.64x to 4x with respect to SCP03 at its maximum security level. The speedup in the channel initiation process is also considerable, with a factor of up to 3.7. These findings demonstrate that it is possible to conceive a new protocol which offers adequate cryptographic protection, while being more lightweight than the present standard.
11:15 CEST TS02.4 RHYCHEE-FL: ROBUST AND EFFICIENT HYPERDIMENSIONAL FEDERATED LEARNING WITH HOMOMORPHIC ENCRYPTION
Speaker:
Yujin Nam, University of California, San Diego, US
Authors:
Yujin Nam1, Abhishek Moitra2, Yeshwanth Venkatesha2, Xiaofan Yu1, Gabrielle De Micheli1, Xuan Wang1, Minxuan Zhou3, Augusto Vega4, Priyadarshini Panda2 and Tajana Rosing1
1University of California, San Diego, US; 2Yale University, US; 3Illinois Tech, US; 4IBM Research, US
Abstract
Federated learning (FL) is a widely-used collaborative learning approach where clients train models locally without sharing their data with servers. However, privacy concerns remain since clients still upload locally trained models, which could reveal sensitive information. Fully homomorphic encryption (FHE) addresses this issue by enabling clients to share encrypted models and the server to aggregate them without decryption. While FHE resolves the privacy concerns, the encrypted data introduces larger communication and computational complexity. Moreover, ciphertexts are vulnerable to channel noise, where a single bit error can disrupt model convergence. To overcome these limitations, we introduce Rhychee-FL, the first lightweight and noise-resilient FHE-enabled FL framework based on Hyperdimensional Computing (HDC), a low-overhead training method. Rhychee-FL leverages HDC's small model size and noise resilience to reduce communication overhead and enhance model robustness without sacrificing accuracy or privacy. Additionally, we thoroughly investigate the parameter space of Rhychee-FL and propose an optimized system in terms of computation and communication costs. Finally, we show that our global model can successfully converge without being impacted by channel noise. Rhychee-FL achieves comparable final accuracy to CNN, while reaching 90% accuracy in 6x fewer rounds and with 2.2x greater communication efficiency. Our framework shows at least 4.5x faster client side latency compared to previous FHE-based FL works.
11:20 CEST TS02.5 COMPROMISING THE INTELLIGENCE OF MODERN DNNS: ON THE EFFECTIVENESS OF TARGETED ROW PRESS
Speaker:
Shaahin Angizi, New Jersey Institute of Technology, US
Authors:
Ranyang Zhou1, Jacqueline Liu2, Sabbir Ahmed3, Shaahin Angizi1 and Adnan Siraj Rakin2
1New Jersey Institute of Technology, US; 2Binghamton University, US; 3Binghamton University (SUNY), US
Abstract
Recent advancements in side-channel attacks have revealed the vulnerability of modern Deep Neural Networks (DNNs) to malicious adversarial weight attacks. The well-studied RowHammer attack has effectively compromised DNN performance by inducing precise and deterministic bit-flips in the main memory (e.g., DRAM). Similarly, RowPress has emerged as another effective strategy for flipping targeted bits in DRAM. However, the impact of RowPress on deep learning applications has yet to be explored in the existing literature, leaving a fundamental research question unanswered: How does RowPress compare to RowHammer in leveraging bit-flip attacks to compromise DNN performance? This paper is the first to address this question and evaluate the impact of RowPress on DNN applications. We conduct a comparative analysis utilizing a novel DRAM-profile-aware attack designed to capture the distinct bit-flip patterns caused by RowHammer and RowPress. Eleven widely-used DNN architectures trained on different benchmark datasets deployed on a Samsung DRAM chip conclusively demonstrate that they suffer from a drastically more rapid performance degradation under the RowPress attack compared to RowHammer. The difference in the underlying attack mechanism of RowHammer and RowPress also renders existing RowHammer mitigation mechanisms ineffective under RowPress. As a result, RowPress introduces a new vulnerability paradigm for DNN compute platforms and unveils the urgent need for corresponding protective measures.
11:25 CEST TS02.6 COALA: COALESCION-BASED ACCELERATION OF POLYNOMIAL MULTIPLICATION FOR GPU EXECUTION
Speaker:
Homer Gamil, New York University, US
Authors:
Homer Gamil, Oleg Mazonka and Michail Maniatakos, New York University Abu Dhabi, AE
Abstract
In this study, we introduce Coala, a novel framework designed to enhance the performance of finite field transformations for GPU environments. We have developed a GPU-optimized version of the Discrete Galois Transformation (DGT), a variant of the Number Theoretic Transform (NTT). We introduce a novel data access pattern scheme specifically engineered to enable coalesced accesses, significantly enhancing the efficiency of data transfers between global and shared memory. This enhancement not only boosts execution efficiency but also optimizes the interaction with the GPU's memory architecture. Additionally, Coala presents a comprehensive framework that optimizes the allocation of computational tasks across the GPU's architecture and execution kernels, thereby maximizing the use of GPU resources. Lastly, we provide a flexible method to adjust security levels and polynomial sizes through the incorporation of an in-kernel RNS method, and a flexible parameter generation approach. Comparative analysis against current state-of-the-art techniques reveals significant improvements. We observe performance gains of 2.82x - 17.18x against other DGT works on GPUs for different parameters, achieved concurrently with equal or lesser memory utilization.
11:30 CEST TS02.7 HEILP: AN ILP-BASED SCALE MANAGEMENT METHOD FOR HOMOMORPHIC ENCRYPTION COMPILER
Speaker:
Weidong Yang, Shanghai Jiao Tong University, CN
Authors:
Weidong Yang, Shuya Ji, Jianfei Jiang, Naifeng Jing, Qin Wang, Zhigang Mao and Weiguang Sheng, Shanghai Jiao Tong University, CN
Abstract
RNS-CKKS, a fully homomorphic encryption (FHE) scheme, enabling secure computation on encrypted data, has widely be used in statistical analysis and data mining. However, developing RNS-CKKS programs requires substantial knowledge of cryptography, which is unfriendly to non-expert programmers. A critical obstacle is the scale management, which affects the complexity of programming and performance. Different FHE operations impose specific requirements on the scale and level, necessitating programmer intervention to ensure the recoverability of the results. Furthermore, operations at different levels have a significant impact on program performance. Existing methods rely on heuristic insights or iterative methods to manage the scales of ciphertexts. However, these methods lack a holistic understanding of the optimization space, leading to inefficient exploration and suboptimal performance. This work proposes HEILP, the first constrained-optimization-based approach for scale management in FHE. HEILP expresses node scale decision and scale management operation inserting as an integer linear programming model which can be solved with existing mathematical techniques in one shot. Our method creates a more comprehensive optimization space and enables a faster and more efficient exploration. Experimental results demonstrate that HEILP achieves an average performance improvement of 1.72xover existing heuristic method, and outperforms a 1.19x performance improvement with 48.65x faster compilation time compared to the state-of-the-art iteration-based method.
11:35 CEST TS02.8 A UNIFIED VECTOR PROCESSING UNIT FOR FULLY HOMOMORPHIC ENCRYPTION
Speaker:
Jiangbin Dong, Xi an Jiaotong University, CN
Authors:
Jiangbin Dong1, Xinhua Chen2 and Mingyu Gao3
1Xi'an Jiaotong University, CN; 2Fudan University, CN; 3Tsinghua University, CN
Abstract
Fully homomorphic encryption (FHE) algorithms enable privacy-preserving computing directly on encrypted data without leaking sensitive contents, while their excessive computational overheads could be alleviated by specialized hardware accelerators. The vector architecture has been prominently used for FHE accelerators to match the underlying polynomial data structures. While most FHE operations can be efficiently supported by vector processing units, the number theoretic transform (NTT) and automorphism operators involve complex and irregular data permutations among vector elements, and thus are handled with separate dedicated hardware units in existing FHE accelerators. In this paper, we present an efficient inter-lane network design and the corresponding dataflow control scheme, in order to realize NTT and automorphism operations among the multiple lanes of a vector unit. An arbitrarily large operator is first decomposed to fit in the fixed width of the vector unit, and the required data permutation and transposition are conducted on the specialized inter-lane network. Compared to previous designs, our solution reduces the hardware resources needed, with up to 9.4x area and 6.0x power savings for only the inter-lane network, and up to 1.2x area and 1.1x power savings for the whole vector unit.
11:40 CEST TS02.9 TESTING ROBUSTNESS OF HOMOMORPHICALLY ENCRYPTED SPLIT MODEL LLMS
Speaker:
Lars Folkerts, University of Delaware, US
Authors:
Lars Folkerts and Nektarios Georgios Tsoutsos, University of Delaware, US
Abstract
Large language models (LLMs) have recently transformed many industries, enhancing content generation, customer service agents, data analysis, and even software generation. These applications are often hosted on remote servers to protect the neural-network model IP; however, this raises concerns about the privacy of input queries. Fully Homomorphic Encryption (FHE), an encryption technique that allows computations on private data, has been proposed as a solution to this challenge. Nevertheless, due to the increased size of LLMs and the computational overheads of FHE, today's practical FHE LLMs are implemented using a split model approach. Here, a user sends their FHE encrypted data to the server to run an encrypted attention head layer; then the server returns the result of the layer for the user to run the rest of the model locally. By employing this method, the server maintains part of their model IP, while the user still gets to perform private LLM inference. In this work, we evaluate the neural-network model IP protections of single-layer split model LLMs, and demonstrate a novel attack vector that makes it easy for a user to extract the neural network model IP from the server, bypassing the claimed protections for encrypted computation. In our analysis, we demonstrate the feasibility of this attack, and discuss potential mitigations.
11:45 CEST TS02.10 TARN: TRUST AWARE ROUTING TO ENHANCE SECURITY IN 3D NETWORK-ON-CHIPS
Speaker:
Naghmeh Karimi, University of Maryland Baltimore County, US
Authors:
Hasin Ishraq Reefat1, Alec Aversa2, Ioannis Savidis2 and Naghmeh Karimi1
1University of Maryland Baltimore County, US; 2Drexel University, US
Abstract
The growing complexity and performance demands of modern computing systems resulted in a shift from traditional System-on-Chip (SoC) designs to Network-on-Chip (NoC) architectures, and further to three-dimensional Network-on-Chip (3D NoC) solutions. Despite their performance and power efficiency, the increased complexity and inter-layer communication of 3D NoCs can create opportunities for adversaries who opt to prevent reliable communications between embedded nodes by inserting hardware Trojans in such nodes. The hardware Trojans, introduced through untrusted third-party Intellectual Property (IP) blocks, can severely compromise 3D NoCs by tampering with data integrity, misrouting packets, or dropping them; thus triggering denial-of-service attacks. Detecting such behaviors is particularly difficult due to their infrequent activation. Thereby it is of utmost importance to take the trustworthiness of the embedded nodes into account when routing the packets in the NoCs. Accordingly, this paper proposes a trust-aware routing scheme, so-called TARN, to significantly reduce the rate of packet loss that can occur due to malicious behaviors of one or more nodes (or interconnects). Our distributed trust-aware path selection protocol bypasses malicious IPs and securely routes packets to their destination. Furthermore, we introduce a lowoverhead mechanism for delegating trust scores to neighboring routers, thereby enhancing network efficiency. Experimental results demonstrate significant improvements in packet loss while imposing low performance and energy overhead.
11:50 CEST TS02.11 C2C: A FRAMEWORK FOR CRITICAL TOKEN CLASSIFICATION IN TRANSFORMER-BASED INFERENCE SYSTEMS
Speaker:
Sihyun Kim, KAIST, KR
Authors:
Myeongjae Jang, Jesung Kim, Haejin Nam, Sihyun Kim and Soontae Kim, KAIST, KR
Abstract
Because embedding vectors in a Transformer-based model represent crucial information about input texts, attacks or errors affecting them can cause severe accuracy degradation. We observe critical tokens for the first time, that determine the overall accuracy but their embedding vectors take only a small portion of the embedding table. Therefore, we propose a framework called C2C that classifies the critical tokens to facilitate their protection in a Transformer-based inference system with a small overhead. Using BERT with GLUE datasets, critical embedding vectors take only 13.8% of the embedding table. Compromising critical embedding vectors can reduce accuracy by up to 44.8% even if other parameters are not corrupted.
11:51 CEST TS02.12 A DRAM-BASED PROCESSING-IN-MEMORY ACCELERATOR FOR PRIVACY-PROTECTING MACHINE LEARNING
Speaker and Author:
Bokyung Kim, Rutgers University, US
Abstract
The unprecedented success of deep neural networks (DNNs) has necessitated large-scale matrix processing. Correspondingly, machine learning (ML) accelerators have evolved for general matrix multiplication (GEMM), and the systolic array has been one of the most successful designs for GEMM. However, its efficiency is questioned in privacy-protecting DNNs of increasing demand, according to our observation. Particularly, differential privacy (DP) has been widely applied to DNN training to protect sensitive information and its unique computation, per-sample gradient norm, needs low-dimensional-tensor processing. Because of this mismatch, DP training shows dreadful efficiency in matrix-tailored systolic accelerators, repeating under-utilized array usage and redundant data transfer. Apart from the GEMM-optimized systolic architecture, this work proposes a vector-processing-oriented DRAM processing-in-memory (PIM) accelerator, DPIMA, for DP training. Leveraging the advantages of DRAM and PIM, we offer novelties on micro-architecture with full operation adder tree (FOAT) and systemic design with dataflow. Our experiments with various learning models demonstrate that DPIMA can achieve 13.8X and 123.8X improvement on average in performance and energy efficiency than the systolic baseline.

CFU Career Fair - University

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 12:00 CEST - 13:00 CEST


LK01 IEEE CEDA Lunchtime Panel: on the occasion of CEDA 20th anniversary

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 13:15 CEST - 14:00 CEST


ASD02 ASD focus session: Cybersecurity Challenges of Autonomous Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST

Organiser:
Sebastian Steinhorst, TU Munich, DE

With the recent dramatic increase in performance of artificial intelligence and related computing systems, together with advanced sensing, connectivity, and technological platforms, autonomous systems are poised to enter many application domains such as transportation and manufacturing. However, as autonomy increases, the risks of cybersecurity threats are equally rising, requiring the development of sophisticated methods on all layers of autonomous systems architectures. In this session, five experts from different areas of cybersecurity research in industry and academia will present challenges ranging from the physical layer to the system of systems layer of autonomous systems. Using the example of autonomous vehicles to highlight current developments, this session will discuss the efforts necessary to achieve secure and safe autonomous systems. The session will comprise individual 10-minute presentations of the five speakers, followed by a panel discussion that will involve the audience and further deepen the exchange.

Time Label Presentation Title
Authors
14:00 CEST ASD02.1 PHYSICAL LAYER INTEGRITY CHECKS
Presenter:
Mridula Singh, CISPA Helmholtz Center for Information Security, DE
Author:
Mridula Singh, CISPA Helmholtz Center for Information Security, DE
Abstract
.
14:10 CEST ASD02.2 ETHERNETIFICATION OF CAN
Presenter:
Alexander Zeh, Infineon Technologies, DE
Author:
Alexander Zeh, Infineon Technologies, DE
Abstract
.
14:20 CEST ASD02.3 SELF-SOVEREIGN IDENTITIES FOR SOFTWARE-DEFINED VEHICLES
Presenter:
Christian Prehofer, fortiss GmbH, DE
Author:
Christian Prehofer, fortiss GmbH, DE
Abstract
.
14:30 CEST ASD02.4 CAN WE ACHIEVE ACCEPTABLE SECURITY FOR AUTONOMOUS SYSTEMS?
Presenter:
Mikael Asplund, Linköping University, SE
Author:
Mikael Asplund, Linköping University, SE
Abstract
.
14:40 CEST ASD02.5 MANAGING CYBERSECURITY IN THE AUTONOMOUS VEHICLE MOBILITY-AS-A-SERVICE SYSTEM-OF-SYSTEMS
Presenter:
Tobias Löhr, P3 automotive GmbH, DE
Author:
Tobias Löhr, P3 automotive GmbH, DE
Abstract
.
14:50 CEST ASD02.6 PANEL DISCUSSION
Presenter:
All the Panelists, DATE 2025, FR
Author:
All the Panelists, DATE 2025, FR
Abstract
.

BPA02 BPA Session 2

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST BPA02.1 QGDP: QUANTUM LEGALIZATION AND DETAILED PLACEMENT FOR SUPERCONDUCTING QUANTUM COMPUTERS
Speaker:
Junyao Zhang, Duke University, US
Authors:
Junyao Zhang1, Guanglei Zhou1, Feng Cheng1, Jonathan Ku1, Qi Ding2, Jiaqi Gu3, Hanrui Wang4, Hai (Helen) Li1 and Yiran Chen1
1Duke University, US; 2Massachusetts Institute of Technology, US; 3Arizona State University, US; 4University of California, Los Angeles, US
Abstract
Quantum computers (QCs) are currently limited by qubit numbers. A major challenge in scaling these systems is crosstalk, which arises from unwanted interactions among neighboring components such as qubits and resonators. An innovative placement strategy tailored for superconducting QCs can systematically address crosstalk within limited substrate areas. Legalization is a crucial stage in placement process, refining post-global-placement configurations to satisfy design constraints and enhance layout quality. However, existing legalizers are not supported to legalize quantum placements. We aim to address this gap with qGDP, developed to meticulously legalize quantum components by adhering to quantum spatial constraints and reducing resonator crossing to alleviate various crosstalk effects. Our results indicate that qGDP effectively legalizes and fine-tunes the layout, addressing the quantum-specific spatial constraints inherent in various device topologies. By evaluating diverse benchmarks. qGDP consistently outperforms state-of-the-art legalization engines, delivering substantial improvements in fidelity and reducing spatial violation, with average gains of 34.4x and 16.9x, respectively
14:20 CEST BPA02.2 RVEBS: EVENT-BASED SAMPLING ON RISC-V
Speaker:
Tiago Rocha, INESC-ID, Instituto Superior Técnico, University of Lisbon, Portugal, Pl
Authors:
Tiago Rocha1, Nuno Neves2, Nuno Roma2, Pedro Tomás3 and Leonel Sousa4
1INESC-ID, PT; 2INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, PT; 3INESC-ID, Instituto Superior Técnico, PT; 4INESC-ID | Universidade de Lisboa, PT
Abstract
As RISC-V ISA continues to gain traction for both embedded and high-performance computing, the demand for advanced monitoring tools has become critical to fine-tuning the applications' performance. Current RISC-V hardware performance monitors already provide basic event counting but lack sophisticated features like event-based sampling, which are available in more established architectures such as x86 and ARM. This paper presents the first RISC-V Event-Based Sampling (RVEBS) system for comprehensive performance monitoring and application profiling. The proposed system builds upon existing RISC-V specifications, incorporating necessary modifications to enable the desired functionality. It also presents an OpenSBI extension to provide privileged software access to newly implemented control status registers that manage the sampling process. An implementation use case based on an OpenPiton processor featuring a CVA6 core on 28nm CMOS technology was presented. The results indicate that the proposed scheme is lightweight, highly accurate, and does not impact the processor's critical path while maintaining minimal impact on overall application performance.
14:40 CEST BPA02.3 XRAY: DETECTING AND EXPLOITING VULNERABILITIES IN ARM AXI INTERCONNECTS
Speaker:
Melisande Zonta, ETH Zurich, CH
Authors:
Melisande Zonta, Nora Hinderling and Shweta Shinde, ETH Zurich, CH
Abstract
The Arm AMBA Advanced eXtensible Interface (AXI) interconnect is a critical IP in FPGA-based designs. While AXI and interconnect designs are primarily optimized for performance, their security requires closer investigation—any bugs in these components can potentially compromise critical IPs like processing systems and memory. To this end, Xray systematically analyzes AXI interconnects. Specifically, it treats the AXI interconnect as a transaction processing block that is expected to adhere to certain properties (e.g., bus and data isolation, progress). Then, Xray employs a traffic generator that creates transaction workloads with the aim of triggering violations in the AXI interconnects. As the last piece of the puzzle, Xray wrappers automatically flag transaction traces as either compliant, errors, or warnings. Put together, Xray comprises 13 properties, has been tested on 7 interconnects, identifies 41 violations corresponding to 41 vulnerabilities. When compared to existing approaches such as verification IPs (VIPs) and protocol checkers from commercial tools, Xray identifies 19 known and 22 new violations. We show the security impact of Xray by sampling 5 Xray violations to construct 3 proof-of-concept exploits on realistic scenarios deployed on FPGA to leak intermediate data, drop transactions, and corrupt memory.

FS08 Focus session - The European Chips Act: Ready to Take-Off

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST

Session chair:
Anton Klotz, Fraunhofer, DE

Session co-chair:
Pascal Vivet, CEA, FR

EU Chips Act is the biggest EU initiative to support European microelectronics industry. After it entered into force on 21. September 2023, first calls have been issued in 2024. It is time to take a look at the progress which has been made in the past two years and take an outlook what lies ahead in 2025 and following years. Our panelists are representing various activities of the EU Chips Act, we have the head of Chips JU, the representatives of the pilot lines and Virtual Design Platform initiative. After impulse presentations, there will be a panel discussion, where the panelists will answer the questions from the audience on the EU Chips Act.

Participants:
Jari Kinaret, CHIPS JU, BE
Olivier Thomas, CEA, FR
Inge Asselberghs, IMEC, BE
Amelie Hagelauer, Fraunhofer, DE
Helio Fernandez Tellez, IMEC, BE


LKS02 Later … with the keynote speakers

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST


TS03 Embedded software architecture, compilers and tool chains

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST TS03.1 MPFS: A SCALABLE USER-SPACE PERSISTENT MEMORY FILE SYSTEM FOR MULTIPLE PROCESSES
Speaker:
Bo Ding, Huazhong University of Science and Technology, CN
Authors:
Bo Ding, Wei Tong, Yu Hua, Yuchong Hu, Zhangyu Chen, Xueliang Wei, Qiankun Liu, Dong Huang and Dan Feng, Huazhong University of Science and Technology, CN
Abstract
Persistent memory (PM) leveraging memory-mapped I/O (MMIO) delivers superior I/O performance, leading to the development of user-space PM file systems based on MMIO. While effective in single-process scenarios, these systems encounter challenges in multi-process environments, such as performance degradation due to repeated page faults and cross-process synchronizations, as well as a large memory footprint from duplicated paging structures. To address these problems, we propose a Multi-process PM File System (MPFS). MPFS builds a shareable page table and shares it among processes, avoiding building duplicate paging structures for distinct processes, thereby significantly reducing the software overhead and memory footprint caused by repeated page faults. MPFS further proposes a PGD-aligned (512GB) mapping method to accelerate page table sharing. Furthermore, MPFS provides a cross-process memory protection mechanism based on the PGD-aligned mapping, ensuring multi-process data reliability with negligible overheads. The experimental results show that MPFS outperforms existing user-space PM file systems by 1560% in multi-process scenarios.
14:05 CEST TS03.2 EILID: EXECUTION INTEGRITY FOR LOW-END IOT DEVICES
Speaker:
Youngil Kim, University of California, Irvine, US
Authors:
Sashidhar Jakkamsetti1, Youngil Kim2, Andrew Searles2 and Gene Tsudik2
1Bosch Research, US; 2University of California, Irvine, US
Abstract
Prior research yielded many techniques to mitigate software compromise for low-end Internet of Things (IoT) devices. Some of them detect software modifications via remote attestation and similar services, while others preventatively ensure software (static) integrity. However, achieving run-time (dynamic) security, e.g., control-flow integrity (CFI), remains a challenge. Control-flow attestation (CFA) is one approach that minimizes the burden on devices. However, CFA is not a real-time countermeasure against run-time attacks since it requires communication with a verifying entity. This poses significant risks if safety- or time-critical tasks have memory vulnerabilities. To address this issue, we construct EILID – a hybrid architecture that ensures software execution integrity by actively monitoring control-flow violations on low-end devices. EILID is built atop CASU, a prevention-based (i.e., active) hybrid Root-of-Trust (RoT) that guarantees software immutability. EILID achieves fine-grained backward-edge and function-level forward-edge CFI via semi-automatic code instrumentation and a secure shadow stack.
14:10 CEST TS03.3 DANCER: DYNAMIC COMPRESSION AND QUANTIZATION ARCHITECTURE FOR DEEP GRAPH CONVOLUTIONAL NETWORK
Speaker:
Yi Wang, Shenzhen University, CN
Authors:
Yunhao Dong, Zhaoyu Zhong, Yi Wang, Chenlin Ma and Tianyu Wang, Shenzhen University, CN
Abstract
Graph Convolutional Networks (GCNs) have been widely applied in fields such as social network analysis and recommendation systems. Recently, deep GCNs have emerged, enabling the exploration of deeper hidden information. Compared to traditional shallow GCNs, deep GCNs feature significantly more layers, leading to considerable computational and data movement challenges. Processing-In-Memory (PIM) offers a promising solution for efficiently handling GCNs by enabling near-data computation, thus reducing data transfer between processing units and memory. However, previous work mainly focused on shallow GCNs and has shown limited performance with deep GCNs. In this paper, we present Dancer, an innovative PIM-based GCN accelerator. Dancer optimizes data movement during the inference process, significantly improving efficiency and reducing energy consumption. Specifically, we introduce a novel compressed graph storage architecture and a dynamic quantization technique to minimize data transfers at each layer of the GCN. Additionally, through a detailed analysis of weight dynamics changes, we propose a sparsity propagation strategy to further alleviate the computational and data transfer burden between layers. Experimental results demonstrate that, compared to current state-of-the-art methods, Dancer achieves 3.7× speedup, 7.6× energy efficiency, and reduces of 9.6× DRAM access on average.
14:15 CEST TS03.4 LOOPLYNX: A SCALABLE DATAFLOW ARCHITECTURE FOR EFFICIENT LLM INFERENCE
Speaker:
Jianing Zheng, Sun Yat-sen University, CN
Authors:
Jianing Zheng and Gang Chen, Sun Yat-sen University, CN
Abstract
In this paper, we propose LoopLynx, a scalable dataflow architecture for efficient LLM inference that optimizes FPGA usage through a hybrid spatial-temporal design. The design of LoopLynx incorporates a hybrid temporal-spatial architecture, where computationally intensive operators are implemented as large dataflow kernels. This achieves high throughput similar to spatial architecture, and organizing and reusing these kernels in a temporal way together enhances FPGA peak performance. Furthermore, to overcome the resource limitations of a single device, we provide a multi-FPGA distributed architecture that overlaps and hides all data transfers so that the distributed accelerators are fully utilized. By doing so, LoopLynx can be effectively scaled to multiple devices to further explore model parallelism for large-scale LLM inference. Evaluation of GPT-2 model demonstrates that LoopLynx can achieve comparable performance to state-of-the-art single FPGA-based accelerations. In addition, compared to Nvidia A100, our accelerator with a dual-FPGA configuration delivers a 2.52x speed-up in inference latency while consuming only 48.1% of the energy.
14:20 CEST TS03.5 REMAPCOM: OPTIMIZING COMPACTION PERFORMANCE OF LSM TREES VIA DATA BLOCK REMAPPING IN SSDS
Speaker:
Yi Fan, Wuhan University of Technology, CN
Authors:
Yi Fan1, Yajuan Du1 and Sam H. Noh2
1Wuhan University of Technology, CN; 2UNIST, KR
Abstract
In LSM-based KV stores, typically deployed on systems with DRAM-SSD storage, compaction degrades write performance and SSD endurance due to significant write amplification. To address this issue, recent proposals have mostly focused on redesigning the structure of LSM trees. In this paper, we observe the prevalence of data blocks that are are simply read and written back without being altered during the LSM tree compaction process, which we refer to as Unchanged Data Blocks (UDBs). These UDBs are source of unnecessary write amplification leading to performance degradation and shortening of SSD lifetime. To address this duplication issue, we propose a remapping-based compaction method, which we call RemapCom. RemapCom considers the identification and retention by designing a lightweight state machine to track the status of the KV items in each data block as well as designing a UDB retention strategy to prevent data blocks from being split due to adjacent intersecting blocks. We implement a prototype of RemapCom on LevelDB by providing two primitives for the remapping. Compared to the state-of-the-art, evaluation results demonstrate that RemapCom can reduce the write amplification by up to 53%.
14:25 CEST TS03.6 A PRACTICAL LEARNING-BASED FTL FOR MEMORY-CONSTRAINED MOBILE FLASH STORAGE
Speaker:
Zelin Du, The Chinese University of Hong Kong, HK
Authors:
Zelin Du1, Kecheng HUANG1, Tianyu Wang2, Xin Yao3, Renhai Chen4 and Zili Shao1
1The Chinese University of Hong Kong, HK; 2Shenzhen University, CN; 3Huawei Inc, HK; 4Huawei Inc, CN
Abstract
The rapidly growing mobile market is pushing flash storage manufacturers to expand capacity into the terabyte range. However, this presents a significant challenge for mobile storage management: more logical-to-physical page mappings are desired to be efficiently managed and cached while the available caching space is extremely limited. This motivates us to shift toward a new learning-based paradigm: rather than maintaining mappings for individual pages, the learning-based approach can represent mapping relationships for a set of continuous pages. However, to construct linear models, existing methods that either consume the already-limited memory space or reuse flash garbage collection demonstrate poor model construction capabilities or significantly degrade flash performance, making them impractical for real-world use. In this paper, we propose LFTL, a practical, learning-based on-demand flash translation layer design for flash management in mobile devices. In contrast to prior work that centered around gathering sufficient mappings for linear model construction, our key insight is that linear patterns can be extracted and refined by leveraging the orderly, LPA-aligned write stream typical of mobile devices. By doing this, highly accurate linear models can be constructed regardless of the constraints of mobile device's cache limitation. We have implemented a fully functional prototype of LFTL based on FEMU. Our evaluation results show that LFTL shows preferable adaptability to memory-constrained storage devices compared to state-of-the-art learning-based approaches.
14:30 CEST TS03.7 CONZONE: A ZONED FLASH STORAGE EMULATOR FOR CONSUMER DEVICES
Speaker:
Dingcui Yu, East China Normal University, CN
Authors:
Dingcui Yu, Jialin Liu, Yumiao Zhao, Wentong Li, Ziang Huang, Zonghuan Yan, Mengyang Ma and Liang Shi, East China Normal University, CN
Abstract
Considering the potential benefits to lifespan and performance, zoned flash storage is expected to be incorporated into the next generation of consumer devices. However, due to the limited volatile cache and heterogeneous flash cells of consumergrade flash storage, adopting a zone abstraction requires additional internal hardware design to maximize its benefits. To understand and efficiently improve the hardware design on consumer-grade zoned flash storage, we present ConZone—the first emulator tailored to the characteristics of consumer-grade zoned flash storage. Users can explore the internal architecture and management strategies of consumer-grade zoned flash storage and integrate the optimization with software. We validate the accuracy of ConZone by realizing a hardware architecture for consumer-grade zoned flash storage and comparing it with the state-of-the-art. We also make a case study for read performance research with ConZone to explore the design of mapping mechanisms and cache management strategies.
14:35 CEST TS03.8 A HARDWARE-ASSISTED APPROACH FOR NON-INVASIVE AND FINE-GRAINED MEMORY POWER MANAGEMENT IN MCUS
Speaker:
Michael Kuhn, University of Tübingen, DE
Authors:
Michael Kuhn, Patrick Schmid and Oliver Bringmann, University of Tübingen, DE
Abstract
The energy demand of embedded systems is crucial and typically dominated by the memory subsystem. Off-the-shelf MCU platforms usually offer a wide range of memory configurations in terms of overall memory size, which may differ in the number of memory banks provided. Split memory banks have the potential to optimize energy demand, but this often remains unused in available hardware due to a lack of power management support or require significant manual effort to leverage the benefits of split-banked memory architectures. This paper proposes an approach to solve the challenge of integrating fine-grained power management support automatically, by a combined hardware/software solution for future off-the-shelf platforms. We present a method to efficiently search for an optimized code and data mapping onto the modules of split memory banks to maximize the idle times of all memory modules. To non-invasively put memory modules into sleep mode, a PC-driven power management controller (PMC) autonomously triggers transitions between power modes during embedded software execution. The evaluation of our optimization flow demonstrates that memory mappings can be explored in seconds, including the generation of the necessary PMC configuration and linker scripts. The application of PC-driven power management enables active memory modules to remain in light sleep mode for approximately 13% to 86% of the execution time, depending on the workload and memory configuration. This results in overall power savings of up to 24% in the memory banks, in terms of static and dynamic power.
14:40 CEST TS03.9 TKD: AN EFFICIENT DEEP LEARNING COMPILER WITH CROSS-DEVICE KNOWLEDGE DISTILLATION
Speaker:
Chaoyao Shen, Southeast University, CN
Authors:
Yiming Ma, Chaoyao Shen, Linfeng Jiang, Tao Xu and Meng Zhang, Southeast University, CN
Abstract
Generating high-performance tensor programs on resource-constrained devices is challenging for current Deep Learning (DL) compilers that use learning-based cost models to predict the performance of tensor programs. Due to the inability of cost models to leverage cross-device information, it is extremely time-consuming to collect data and train a new cost model. To address this problem, this paper proposes TKD, a novel DL compiler that can be efficiently adapted to devices that are resource-constrained. TKD reduces the time budget by over 11x through an adaptive tensor program filter that eliminates redundant and unimportant measurements of tensor programs. Furthermore, by refining the cost model architecture with a multi-head attention module and distilling transferable knowledge from source devices, TKD outperforms state-of-the-art methods in prediction accuracy, compilation time, and compilation quality. We conducted experiments on the edge GPU, NVIDIA Jetson TX2, and the results show that compared to TenSet and TLP, TKD reduces compilation time by 1.58x and 1.16x, while achieving 1.40x and 1.27x speedups of the tensor programs, respectively.
14:45 CEST TS03.10 DISPEED: DISTRIBUTING PACKET FLOW ANALYSES IN A SWARM OF HETEROGENEOUS EMBEDDED PLATFORMS
Speaker:
Louis Morge-rollet, ENSTA Institut Polytechnique de Paris, FR
Authors:
Louis Morge-Rollet1, Camelia Slimani2, Laurent Lemarchand3, Frédéric Leroy4, Jalil Boukhobza5 and David Espes3
1ENSTA-Bretagne, FR; 2ENSTA Bretagne, FR; 3University Brest, FR; 4ENSTA-bretagne, FR; 5ENSTA-Bretagne Lab-STICC, FR
Abstract
Security is a major challenge in swarm of drones. Network intrusion detection systems (IDS) are deployed to analyze and detect suspicious packet flows. Traditionally, they are implemented independently on each drone. However, due to heterogeneity and resource limitations of drones, IDS algorithms can fall short in satisfying Quality of Service (QoS) metrics, such as latency and accuracy. We argue that a drone can make profit from the swarm by delegating part of the analysis of their packet flows to neighbor drones that have more processing power to enforce security. In this paper, we propose two solving methods to distribute the packet flows to analyze among drones in a way to ensure that it is processed with a minimum communication overhead to limit the attack surface, while ensuring QoS metrics imposed by the drone mission. First, we propose a formulation of the distribution problem using both an Integer Linear Programming (ILP) and a Maximum-Flow Minimum-Cost (MFMC). Furthermore, we propose two specific solving methods for the distribution problem: (1) a Greedy Heuristic (GH), a non-exact solving method, but with small time overhead, and (2) an Adapted Edmonds-Karp (AEK) algorithm, an exact method, but with a higher time overhead. GH proved to be a very fast solution (up to more than 2000x faster than ILP with Branch and Bound), while AEK solution proved to find the exact solution even when the problem is very difficult.
14:50 CEST TS03.11 ONE GRAY CODE FITS ALL: OPTIMIZING ACCESS TIME WITH BI-DIRECTIONAL PROGRAMMING FOR QLC SSDS
Speaker:
Tianyu Wang, Shenzhen University, CN
Authors:
Shaoqi Li1, Tianyu Wang1, Yongbiao Zhu1, Chenlin Ma1, Yi Wang1, Zhaoyan Shen2 and Zili Shao3
1Shenzhen University, CN; 2Shandong University, CN; 3The Chinese University of Hong Kong, HK
Abstract
Gray code, a voltage-level-to-data-bit translation scheme, is widely used in QLC SSDs. However, it causes the four data bits in QLC to exhibit significantly different read and write performance with up to 8x latency variation, severely impacting the worst-case performance of QLC SSDs. This paper presents BDP, a novel Bi-Directional Programming scheme. Based on a fixed Gray code, BDP combines both the normal (forward) and reverse programming directions to enable runtime programming direction arbitration. Experimental results show that BDP can effectively improve the read and write performance of SSD compared to representative schemes.

UF University Fair & Student Teams Fair

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST


W01 Eco-ES: Eco-design and circular economy of Electronic Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 18:00 CEST


W04 5th Workshop on Open-Source Design Automation (OSDA 2025)

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 18:00 CEST


CFI-CP Career Fair - Industry: Company Presentations

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:15 CEST - 17:30 CEST


ASD03 ASD focus session: Dynamic, Multi-Agent Sensing-to-Action Loops in Distributed Autonomous Edge Computing Systems: Opportunities and Challenges

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST

Organisers:
Amit Ranjan Trivedi, University of Illinois at Chicago, US
Saibal Mukhopadhyay, Georgia Tech, US

Autonomous edge computing in robotics, smart cities, and autonomous vehicles depends on seamlessly integrating sensing, processing, and actuation for real-time decision-making in dynamic environments. At its core is the sensing-to-action loop, which continuously aligns sensor inputs with computational models to drive adaptive control. These loops enhance responsiveness by adapting to hyper-local conditions but face challenges like resource constraints, synchronization delays in multi-modal data fusion, and the risk of cascading errors. This focus session examines how proactive, context-aware sensing-to-action and action-to-sensing adaptations can improve efficiency by dynamically adjusting sensing and computation based on task demands, such as selectively sensing a small part of the environment and predicting the rest. Action-to-sensing pathways improve task relevance and resource use by guiding sensing through control actions but require robust monitoring to prevent cascading errors. Multi-agent sensing-action loops extend these benefits through coordinated sensing and actions, optimizing resources via collaboration. Additionally, neuromorphic computing, inspired by biological systems, enables spike-based, event-driven processing that conserves energy, reduces latency, and supports hierarchical control—making it well-suited for multi-agent optimization. Finally, the session highlights the importance of co-designing algorithms, hardware, and environmental dynamics to improve throughput, precision, and adaptability, ultimately advancing energy-efficient edge autonomy in complex environments.

Time Label Presentation Title
Authors
16:30 CEST ASD03.1 SPECULATIVE EDGE-CLOUD DECODING FOR FAST AND RELIABLE DECISION-MAKING IN AUTONOMOUS SYSTEMS
Presenter:
Priyadarshini Panda, Yale University, US
Author:
Priyadarshini Panda, Yale University, US
Abstract
.
16:40 CEST ASD03.2 FILLING IN THE SENSING BLANKS WITH GENERATIVE AI: ULTRA-FRUGAL LIDAR PERCEPTION USING MASKED AUTOENCODERS FOR AUTONOMOUS NAVIGATION
Presenter:
Amit Trivedi, University of Illinois at Chicago, US
Author:
Amit Trivedi, University of Illinois at Chicago, US
Abstract
.
16:50 CEST ASD03.3 ROBOKOOP: EFFICIENT VISUAL CONTROL REPRESENTATIONS FOR ROBOTICS VIA THE KOOPMAN OPERATOR
Presenter:
Saibal Mukhopadhyay, Georgia Tech, US
Author:
Saibal Mukhopadhyay, Georgia Tech, US
Abstract
.
17:00 CEST ASD03.4 NEUROMORPHIC NAVIGATION IN THE REAL WORLD: INTEGRATING REAL-TIME EVENT-BASED VISION WITH PHYSICS-GUIDED PLANNING
Presenter:
Kaushik Roy, Purdue University, US
Author:
Kaushik Roy, Purdue University, US
Abstract
.
17:10 CEST ASD03.5 PANEL DISCUSSION
Presenter:
All the Panelists, DATE 2025, FR
Author:
All the Panelists, DATE 2025, FR
Abstract
.

BPA03 BPA Session 3

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST BPA03.1 TYRCA: A RISC-V TIGHTLY-COUPLED ACCELERATOR FOR CODE-BASED CRYPTOGRAPHY
Speaker:
Alessandra Dolmeta, Politecnico di Torino, IT
Authors:
Alessandra Dolmeta1, Stefano Di Matteo2, Emanuele Valea3, Mikael Carmona4, Antoine Loiseau4, Maurizio Martina5 and Guido Masera5
1Politecnico di Torino, IT; 2CEA-Leti, CEA-List, FR; 3CEA-List, FR; 4CEA-Leti, FR; 5DET - Politecnico di Torino, IT
Abstract
Post-quantum cryptography (PQC) has garnered significant attention across various communities, particularly with the National Institute of Standards and Technology (NIST) advancing to the fourth round of PQC standardization. One of the leading candidates is Hamming Quasi-Cyclic (HQC), which received a significant update on February 23, 2024. This update, which introduces a classical dense-dense multiplication approach, has not known dedicated hardware implementations yet. The innovative Core-V eXtension InterFace (CV-X-IF) is a communication interface for RISC-V processors that significantly facilitates the integration of new instructions to the Instruction Set Architecture (ISA), through tightly connected accelerators. In this paper, we present a TightlY-coupled accelerator for RISC-V for Code-based cryptogrAphy (TYRCA), proposing the first fully tightly-coupled hardware implementation of the HQC-PQC algorithm, leveraging the CV-X-IF. The proposed architecture is implemented on the Xilinx Kintex-7 FPGA. Experimental results demonstrate that TYRCA reduces the execution time by 94% to 96% for HQC-128, HQC-192, and HQC-256, showcasing its potential for efficient HQC code-based cryptography.
16:50 CEST BPA03.2 A SOFT ERROR TOLERANT DUAL STORAGE MODE FLIP-FLOP FOR EFPGA CONFIGURATION HARDENING IN 22NM FINFET PROCESS
Speaker:
Prashanth Mohan, Carnegie Mellon University, US
Authors:
Prashanth Mohan1, Siddharth Das1, Oguz Aatli1, Josh Joffrion2 and Ken Mai1
1Carnegie Mellon University, US; 2Sandia National Laboratories, US
Abstract
We propose a soft error tolerant flip-flop (FF) design to protect configuration storage cells in standard cell-based embedded FPGA fabrics used in SoC designs. Traditional rad-hard FFs such as DICE and Triple Modular Redundant (TMR) use additional redundant storage nodes for soft error tolerance and hence incur high area overheads. Since the eFPGA configuration storage is static, the master latch of the FF is transparent and unused, except when a configuration is loaded. The proposed dual-storage-mode (DSM) FF reuses the master and slave latches as redundant storage along with a C-element for error correction. The DSM FF was fabricated on a 22nm FinFET process along with standard D-FF, pulse DICE FF, and TMR FF designs to evaluate soft error tolerance. The radiation test results show that the DSM FF can reduce the error cross section by more than three orders of magnitude (3735X) compared to the standard D-FF and two orders of magnitude (455X) compared to the pulse DICE FF with a comparable area. Additionally the DSM FF is ~42% smaller than the TMR FF with similar error cross section.
17:10 CEST BPA03.3 REBERT: LLM FOR GATE-LEVEL TO WORD-LEVEL REVERSE ENGINEERING
Speaker:
Azadeh Davoodi, University of Wisconsin Madison, US
Authors:
Lizi Zhang1, Azadeh Davoodi2 and Rasit Topaloglu3
1University of Wisconsin Madison, US; 2University of Wisconsin-Madison, US; 3Adeia, US
Abstract
In this paper, we introduce ReBERT, a specialized large language model (LLM) based on BERT, fine-tuned specifically for grouping bits into words within gate-level netlists. By treating the netlist as a form of language, we encode bits and their fan-in cones into sequences that capture structural dependencies. A novel contribution is augmenting BERT's embedding with a tree-based embedding strategy which mirrors the hierarchical nature of circuit designs in hardware. Leveraging the powerful representational learning capabilities of LLMs, we interpret hardware circuits at a higher level of abstraction. We evaluate ReBERT on various hardware designs, demonstrating that it significantly outperforms a state-of-the-art work based on partial structural matching in recovering word-level groupings. Our improvements are on average between 12.2% to 218.2% depending on degree of corrupting the structural patterns.

FS03 Focus session - Design Automation for Physical Computing Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST

Session chair:
Antonino Tumeo, PNNL, US

Organiser:
Anup Das, Drexel University, US

Time Label Presentation Title
Authors
16:30 CEST FS03.1 ANALOG SYSTEM SYNTHESIS FOR FPAAS AND CUSTOM ANALOG IC DESIGN
Speaker:
Jennifer Hasler, Georgia Tech, US
Authors:
Jennifer Hasler, Afolabi Ige and Linhao Yang, Georgia Tech, US
Abstract
Synthesis tools can unlock the potential of analog architectures to achieve real-time computation, signal processing, inference and learning for low SWaP systems in commercial timescales.   We present a methodology and results towards system analog and mixed-signal synthesis both for FPAAs and Custom Analog IC design.  Building on previously efforts on large-scale Field Programmable Analog Arrays (FPAA) targeting tools enables tools capable of synthesizing new ICs.  The IC synthesis is built upon our recent work on analog & mixed-signal programmable CMOS standard cell library that can be demonstrates across a range of CMOS process nodes  (e.g. 180nm, 130nm, 65nm, 28nm, and 16nm CMOS).    This synthesis can be extended to synthesizing new configurabile fabrics for a new IC and generate the resulting configuration files to target that fabric.   The entire tool-flow is being developed as an open-source tool that can be widely available.     These approaches enable moving analog and mixed-signal design towards structured Design Space Exploration}(DSE), and create a significant need towards rapid analog simulation.
16:53 CEST FS03.2 GAIN-BASED COMPUTING WITH COUPLED LIGHT AND MATTER
Presenter:
Natalia Berloff, University of Cambridge, GB
Author:
Natalia Berloff, University of Cambridge, GB
Abstract
Gain-based computing based on light-matter interactions is a novel approach to physics-based hardware and physics-inspired algorithms. In gain-based computing, the complex optimisation problems are encoded in the gain and loss rates of driven-dissipative systems. The system is driven through a symmetry-breaking transition on the changing loss landscape until a mode that minimises losses is selected, manifesting the optimal solution to the original problem. This process allows for solving important combinatorial optimisation problems via mapping to Ising, XY, and k-local Hamiltonians, emphasising the system's applicability across various physical platforms, including photonic, electronic, and atomic systems. Two primary directions have emerged for developing gain-based analogue hardware, each using distinct aspects of physics' role in computational processes. The first approach exploits natural evolution principles of physical systems influenced and driven by external parameters, with the challenge of establishing controllable couplings between 'spins'. Polariton condensates in inorganic and organic-inorganic halide perovskites, atoms in QEDs and degenerate laser systems exemplify this. Conversely, the second approach, represented by technologies like analogue interactive machines (AIM) and spatial photonic machines (SPIM), focuses on establishing couplings through processes like light propagation, optical modulation, and signal detection, thereby managing system dynamics through feedback loops. Both platforms' core is the critical process leading to the optimum solution. Despite advancements in the physical realisation of these concepts, critical questions remain about scalability, the influence of phase space structures on system performance, and identifying problems best suited for these unconventional computing architectures. We need to understand the dynamic behaviour of the systems during symmetry-breaking transitions, trajectory optimisation towards global minima, error probabilities, and the potential for dissipation and nonlinearities to rectify these errors, highlighting the pivotal role of theory in addressing these challenges. By comparing various experimental platforms, including polaritons, lasers, and cold atoms, we should emphasise and exploit the universal nature of these questions. My talk outlines a strategic plan to tackle these outstanding questions while discussing and contrasting different approaches
17:15 CEST FS03.3 CHEMCOMP: COMPILING AND COMPUTING WITH CHEMICAL REACTION NETWORKS
Speaker:
Antonino Tumeo, Pacific Northwest National Laboratory, US
Authors:
Nicolas Agostini, Connah Johnson, William Cannon and Antonino Tumeo, Pacific Northwest National Laboratory, US
Abstract
The exponential growth in computing demands driven by scientific computing, data analytics, and artificial intelligence is pushing conventional CMOS-based high-performance computing systems to their physical and energy efficiency limits. As we approach the era of post-exascale computing, disruptive approaches are necessary to overcome these barriers and achieve substantial gains in energy efficiency. Analog and hybrid digital-analog computing systems have emerged as promising alternatives, offering the potential for orders-of-magnitude improvements in efficiency. Among these, biochemical computing stands out as a novel paradigm capable of leveraging the natural efficiency of chemical reactions, which inherently solve optimization problems by converging to steady states. By scaling up reaction networks or reaction vessel sizes, biochemical systems present an opportunity to meet the high-performance demands of modern computing tasks. Despite their promise, significant theoretical and practical challenges remain, particularly in formulating and mapping computational problems to chemical reaction networks (CRNs) and designing viable biochemical computing devices. This paper addresses these challenges by introducing ChemComp, a comprehensive framework for chemical computation. The framework features an abstract chemical reaction dialect implemented as a multi-level intermediate representation (MLIR) compiler extension and provides a systematic approach to translating mathematical problems into CRNs. We demonstrate the potential of our framework through a case study emulating a simplified chemical reservoir computing device. This work establishes the foundational tools and methodologies necessary to harness the computational power of chemistry, paving the way for the development of energy-efficient, high-performance computing systems tailored to contemporary and future computational needs.
17:38 CEST FS03.4 EXPLORING DENDRITIC COMPUTATION IN BIO-INSPIRED ARCHITECTURES FOR DYNAMIC PROGRAMMING
Speaker and Author:
Anup Das, Drexel University, US
Abstract
Dynamic programming is a classical optimization technique that systematically decomposes a complex problem into simpler sub-problems to find an optimal solution. We explore the use of bio-inspired architectures to find the shortest path between two nodes in a graph using dynamic programming. We leverage dendritic computations, which are linear and non- linear mechanisms in neuronal dendrites that allow to implement different computational primitives. We exploit two key mechanisms: 1) a dendrite acts as a delay line to propagate an excitatory post-synaptic potential to the soma, and 2) a feedback mechanism from the soma into the dendrites to control this delay. Our key ideas are the following. First, we model each node on a graph as a leaky integrate-and-fire (LIF) neuron, supporting the two dendritic mechanisms. We use a countdown counter to implement forward propagation of a delayed synaptic potential and eligibility trace-based feedback to update the delay by incorporating the cost of edges in a graph. Next, we formulate dynamic programming in terms of the time to the first spike in neurons. We breakdown the shortest path problem into sub- problems of finding the earliest firing times of neurons and iteratively building the final solution from these smaller sub- problems by tracing backward. We implement this approach for several real-world graphs and show its scalability. We also show an early prototype on a Virtex UltraScale FPGA.

TS04 Emerging design technologies for future memories

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS04.1 GLEAM: GRAPH-BASED LEARNING THROUGH EFFICIENT AGGREGATION IN MEMORY
Speaker:
Ivris Raymond, University of Michigan, US
Authors:
Andrew McCrabb, Ivris Raymond and Valeria Bertacco, University of Michigan, US
Abstract
Graph Neural Networks (GNNs) have emerged as a powerful tool for analyzing relationship-based data, such as those found in social networks, logistics, weather forecasting, and other domains. Inference and training with GNN models execute slowly, bottlenecked by limited data bandwidths between memory and GPU hosts, as a result of the many irregular memory accesses inherent to GNN-based computation. To overcome these limitations, we present GLEAM, a Processing-in-Memory (PIM) hardware accelerator designed specifically for GNN-based training and inference. GLEAM units are placed per-bank and leverage the much larger, internal bandwidth of HBMs to handle GNNs' irregular memory accesses, significantly boosting performance and reducing the energy consumption entailed by the dominant activity of GNN-based computation: neighbor aggregation. Our evaluation of GLEAM demonstrates up to a 10x speedup for GNN inference over GPU baselines, alongside a significant reduction in energy usage.
16:35 CEST TS04.2 PFP: PARALLEL FLOATING-POINT VECTOR MULTIPLICATION ACCELERATION IN MAGIC RERAM
Speaker:
Wenqing Wang, National University of Defense Technology, CN
Authors:
Wenqing Wang, Ziming Chen, Quan Deng and Liang Fang, National University of Defense Technology, CN
Abstract
Emerging applications, e.g., machine learning, large language models (LLMs), and graphic processing, are rapidly developing and are both compute-intensive and memory-intensive. Computing in Memory (CIM) is a promising architecture that accelerates these applications by eliminating the data movement between memory and processing units. Memristor-aided logic (MAGIC) CIM achieves massive parallelism, flexible computing, and non-volatility. However, MAGIC ReRAM performs floating-point (FP) vector multiplication sequentially, which wastes parallel computing resources and is limited by the array size. To solve this issue, we propose a parallel floating-point vector multiplication accelerator in MAGIC ReRAM. We exploit three levels of parallelism during the calculation of FP vector multiplication, referred to as PFP. First, we leverage the parallelism of MAGIC ReRAM. Second, we bring forward the final exponent to make the exponent calculations parallel. Third, we decouple the calculation of exponent, mantissa, and sign, which allows parallel calculation across accumulation. The experimental results show that PFP achieves a performance speedup of 2.51× and 15% energy savings compared to AritPIM when performing FP32 vector multiplication with a vector length of 512.
16:40 CEST TS04.3 AN EDRAM DIGITAL IN-MEMORY NEURAL NETWORK ACCELERATOR FOR HIGH-THROUGHPUT AND EXTENDED DATA RETENTION TIME
Speaker:
Jehun Lee, Seoul National University, KR
Authors:
Inhwan Lee1, Jehun Lee2, Jaeyong Jang2 and Jae-Joon Kim2
1Pohang University of Science and Technology, KR; 2Seoul National University, KR
Abstract
Computing-in-Memory (CIM) optimizes multiply-and-accumulate (MAC) operations for energy-efficient acceleration of neural network models. While SRAM has been a popular choice for CIM designs due to its compatibility with logic processes, its large cell size restricts storage capacity for neural network parameters. Consequently, gain-cell eDRAM, featuring memory cells with only 2-4 transistors, has emerged as an alternative for CIM cells. While digital CIM (DCIM) structure has been actively adopted in SRAM-based CIMs for better accuracy and scalability than analog CIMs (ACIM), previous eDRAM-based CIMs still employed ACIM structure since the eDRAM CIM cells were not able to perform a complete digital logic operation. In this paper, we propose an eDRAM bit cell for more efficient DCIM operations using only 4 transistors. The proposed eDRAM DCIM structure also maintains consistent and accurate output values over time, improving retention times compared to previous eDRAM ACIM designs. We validate our approach by fabricating an eDRAM DCIM macro chip and conducting hardware validation experiments, measuring retention time and neural network accuracy. Experimental results show that the proposed eDRAM DCIM achieves 3× longer retention time than state-of-the-art eDRAM ACIM designs, along with higher throughput without accuracy loss.
16:45 CEST TS04.4 A TWO-LEVEL SLC CACHE HIERARCHY FOR HYBRID SSDS
Speaker:
Jun Li, Nanjing University of Posts and Telecommunications, CN
Authors:
Li Cai1, Zhibing Sha1, Jun Li2, Jiaojiao Wu1, Huanhuan Tian1, Zhigang Cai1 and Jianwei Liao1
1Southwest University, CN; 2Nanjing University of Posts and Telecommunications, CN
Abstract
Although high-density NAND flash memory, such as triple-level-cell (TLC) flash memory can offer high density, its lower write performance and endurance compared to single-levelcell (SLC) flash memory are impediments to the proliferation of TLC products. To overcome such disadvantages of TLC flash memory, hybrid architectures, which integrate a portion of SLC chips and employ them as a write cache, are widely adopted in commercial solid-state disks (SSDs). However, it is challenging to optimize the SLC cache, such as the granularity of cached data and the cold/hot data separation. In this paper, we propose supporting two-level hierarchy (i.e. L1 and L2) of SLC cache stores based on varying granularity of cached data. Moreover, we support the segmentation of the L1 and the L2 cache in the SLC region with a dynamic manner, by considering the write size characteristics of user applications. The evaluation results show that our proposal can improve I/O performance by between 12.6% and 25.1%, in contrast to existing cache management schemes for SLC-TLC hybrid storage.
16:50 CEST TS04.5 MULTI-MODE BORDERGUARD CONTROLLERS FOR EFFICIENT ON-CHIP COMMUNICATION IN HETEROGENEOUS DIGITAL/ANALOG NEURAL PROCESSING UNITS
Speaker:
Hong Pang, ETH Zurich, CH
Authors:
Hong Pang1, Carmine Cappetta2, Riccardo Massa2, Athanasios Vasilopoulos3, Elena Ferro3, Gamze Islamoglu1, Angelo Garofalo4, Francesco Conti5, Luca Benini6, Irem Boybat3 and Thomas Boesch7
1ETH Zurich, CH; 2STMicroelectronics, IT; 3IBM Research Europe - Zurich, CH; 4University of Bologna, ETH Zurich, IT; 5Università di Bologna, IT; 6ETH Zurich, CH | Università di Bologna, IT; 7STMicroelectronics, CH
Abstract
Driven by the growing demand for data-intensive parallel computation, particularly for Matrix-Vector Multiplications (MVMs), and the pursuit of high energy efficiency, Analog In-Memory Computing (AIMC) has garnered significant attention. AIMC addresses the data movement bottleneck by performing MVMs directly within memory, significantly reducing latency and enhancing energy efficiency. Integrating AIMC with digital units for non-MVM operations yields heterogeneous Neural Processing Units (NPUs) that can be combined in a tiled architecture to deliver promising solutions for end-to-end AI inference. Besides powerful heterogeneous NPUs, an efficient on-chip communication infrastructure is also pivotal for inter-node data transmission and efficient AI model execution. This paper introduces the Borderguard Controller (BG-CTRL), a multi-mode, path-through routing controller designed to support three distinct operating modes—time-scheduling, data-driven, and time-sliced data-driven (TSDD)—each offering varying levels of routing flexibility and energy efficiency depending on the data flow patterns and AI model complexity. To demonstrate the design, BG-CTRLs are integrated into a 9-node system of heterogeneous NPUs, arranged in a 3x3 grid and connected using a 2D mesh topology. The system is synthesized using STM 28nm FD-SOI technology. Experimental results show that the BG-CTRL cluster achieves an aggregate throughput of 983 Gb/s, with an energy efficiency of up to 0.41 pJ/B/hop at 0.64 GHz, and a minimal area overhead of 204 kGE.
16:55 CEST TS04.6 MAPPING SPIKING NEURAL NETWORKS TO HETEROGENEOUS CROSSBAR ARCHITECTURES USING INTEGER LINEAR PROGRAMMING
Speaker:
Devin Pohl, Georgia Tech, US
Authors:
Devin Pohl1, Aaron Young2, Kazi Asifuzzaman2, Narasinga Miniskar2 and Jeffrey Vetter2
1Georgia Tech, US; 2Oak Ridge National Lab, US
Abstract
Advances in novel hardware devices and architectures allow Spiking Neural Network (SNN) evaluation using ultra-low power, mixed-signal, memristor crossbar arrays. As individual network sizes quickly scale beyond the dimensional capabilities of single crossbars, networks must be mapped onto multiple crossbars. Crossbar sizes within modern Memristor Crossbar Architectures (MCAs) are determined predominately not by device technology but by network topology; more, smaller crossbars consume less area thanks to the high structural sparsity found in larger, brain-inspired SNNs. Motivated by continuing increases in SNN sparsity due to improvements in training methods, we propose utilizing heterogeneous crossbar sizes to further reduce area consumption. This approach was previously unachievable as prior compiler studies only explored solutions targeting homogeneous MCAs. Our work improves on the state-of-the-art by providing Integer Linear Programming (ILP) formulations supporting arbitrarily heterogeneous architectures. By modeling axonal interactions between neurons, our methods produce better mappings while removing inhibitive a priori knowledge requirements. We first show a 16.7–27.6% reduction in area consumption for square-crossbar homogeneous architectures. Then, we demonstrate 66.9–72.7% further reduction when using a reasonable configuration of heterogeneous crossbar dimensions. Next, we present a new optimization formulation capable of minimizing the number of inter-crossbar routes. When applied to solutions already near-optimal in area, an 11.9–26.4% routing reduction is observed without impacting area consumption. Finally, we present a profile-guided optimization capable of minimizing the number of runtime spikes between crossbars. Compared to the best-area-then-route optimized solutions, we observe a further 0.5–14.8% inter-crossbar spike reduction while requiring 1–3 orders of magnitude less solver time.
17:00 CEST TS04.7 AN EFFICIENT ON-CHIP REFERENCE SEARCH AND OPTIMIZATION ALGORITHMS FOR VARIATION-TOLERANT STT-MRAM READ
Speaker:
Kiho Chung, Sungkyunkwan University, KR
Authors:
Kiho Chung, Youjin Choi, Donguk Seo and Yoonmyung Lee, Sungkyunkwan University, KR
Abstract
A novel reference search algorithm is proposed in this paper to significantly reduce the reference search time of embedded spin transfer torque magnetic random access memory (STT-MRAM). Unlike conventional methods that sequentially search reference levels with linearly increasing references, the proposed Dual Read Reference Search (DRRS) algorithm requires only two array read operations. By analyzing the statistical characteristics of the read data using a customized function, the optimal reference level can be quickly determined in a few steps. Consequently, the number of read operations required for a reference search is reduced, providing a substantial improvement in the reference search time. The DRRS algorithm can be operated on-chip, and its effectiveness was confirmed through simulations. The optimization speed was improved by 85% compared to the conventional methods. Additionally, an Triple Read Reference Search (TRRS) algorithm is proposed to decrease the variation occurring across different cell arrays and to enhance optimization accuracy. STT-MRAM is composed of numerous cell arrays, where the cell distributions in each array exhibit different characteristics. The TRRS algorithm enhances optimization accuracy for variations occurring in each array, achieving over a 2x increase in accuracy compared to the DRRS algorithm. Furthermore, Simultaneous Reference Search for P and AP (SRS) algorithm that significantly reduces the search time by simultaneously optimizing Parallel (P) and Anti-parallel state (AP) reference cells is also proposed. Lastly, regarding cell degradation after power-up, we enable prompt re-optimization through revolutionary time-saving algorithms (DRRS, TRRS and SRS). This allows for rapid re-optimization in the event of errors caused by cell degradation and ensures regular optimization to maintain maximum read margin even before errors occur, thereby enhancing reliability.
17:05 CEST TS04.8 FDAIMC: A FULLY-DIFFERENTIAL ANALOG IN-MEMORY-COMPUTING FOR MAC IN MRAM WITH ACCURACY CALIBRATION UNDER PROCESS AND VOLTAGE VARIATION
Speaker:
Xiangyu Li, School of Microelectronics Science and Technology, Sun Yat-sen University, CN
Authors:
Xiangyu Li1, Weichong Chen1, Ruida Hong1, Jinghai Wang2, Ningyuan Yin1 and Zhiyi Yu1
1School of Microelectronics Science and Technology, Sun Yat-sen University, CN; 2Sun Yat-sen University, CN
Abstract
Analog in-memory-computing (AIMC) is adopted extensively in non-volatile memory for multibit multiply-and-accumulate (MAC) operation. However, the low-on/off-ratio feature of magnetic tunnel junction (MTJ) impedes a high-performance AIMC macro based on spin transfer torque magnetic random access memory (STT-MRAM). Secondly, because of the uncertainty feature of a mixed-signal system under process and voltage variation, a calibration support is indispensable. Moreover, the incompatibility between a nonlinear analog signal and a linear digital signal hinders accurate computation and calibration support. To overcome these challenges, this work proposes a STT-MRAM-AIMC macro featuring: 1) a 2-level-differential cell array and a linear computing scheme with a calibration support in analog domain; 2) an analog-digital-conversion (ADC) system, including a slew-rate-independent voltage-to-time converter (SRIVTC) scheme and a self-triggered time-to-MAC value converter (STTMC) scheme; 3) a compact layout design for high area efficiency. Finally, an average accuracy of 95.44% is obtained under the TT&0.9V corner. By using the calibration strategy, the average accuracy of 97.8% and 88.6% are obtained under FF&0.945V and SS&0.855V separately, with over 30% enhancement. Furthermore, a 1.64~21.18 times area FoM than state of the art is obtained. An energy efficiency of 87.2~312.4 TOPS/W is obtained.
17:10 CEST TS04.9 ARBITER: ALLEVIATING CONCURRENT WRITE AMPLIFICATION IN PERSISTENT MEMORY
Speaker:
Bolun Zhu, Huazhong University of Science and Technology, CN
Authors:
Bolun Zhu and Yu Hua, Huazhong University of Science and Technology, CN
Abstract
Persistent memory (PM) is able to bridge the gap between the high performance and persistence, thus receiving many research attentions. The concurrency in PM is often constrained due to limited concurrent I/O bandwidth. The I/O requests from different threads are serialized and interleaved in the memory controller. Such concurrent interleaving unintentionally hurts the locality of PM's on-DIMM buffer (XPBuffer) and thus causes significant performance degradation. Existing systems either endure performance degradation caused by the concurrent interleaving or leverage dedicated background threads to asynchronously perform I/O to PM. Unlike conventional designs, we present a non-blocking synchronous I/O scheduling mechanism that can achieve high performance and low I/O amplification. The key insight is that inserting a proper number of delays to I/O can mitigate the I/O amplification and improve the effective bandwidth. We periodically assess the system states and adaptively determine the number of delays to be inserted for each thread. Evaluation results show that our design can significantly alleviate the I/O amplification and improve application performance for concurrent applications.
17:15 CEST TS04.10 TRACKSCORER: SKYRMION LOGIC-IN-MEMORY ACCELERATOR FOR TREE-BASED RANKING MODELS
Speaker:
Elijah Cishugi, University of Twente, NL
Authors:
Elijah Cishugi1, Sebastian Buschjäger2, Martijn Noorlander1, Marco Ottavi3 and Kuan-Hsun Chen1
1University of Twente, NL; 2The Lamarr Institute for Machine Learning and Artificial Intelligence and TU Dortmund University, DE; 3University of Rome Tor Vergata | University of Twente, IT
Abstract
Racetrack memories (RTMs) have been shown to have lower leakage power and higher density compared to traditional DRAM/SRAM technologies. However, their efficiency is often hindered by the need to shift the targeted data to access ports for read and write operations. Suitable mapping approaches are therefore essential to unleash their potential. In this work, we explore the mapping of the popular tree-based document ranking algorithm, Quickscorer, onto Skyrmion-based racetrack memories (SK-RTMs). Our approach leverages a Logic-in-Memory (LiM) accelerator, specifically designed to execute simple logic operations directly within SK-RTMs, enabling an efficient mapping of Quickscorer by exploiting its bitvector representation and interleaved traversal scheme of tree structures through bitwise logical operations. We present several mapping strategies, including one based on a quadratic assignment problem (QAP) optimization algorithm for optimal data placement of Quickscorer onto the racetracks. Our results demonstrate a significant reduction in read and write operations and, in certain cases, a decrease in the time spent shifting data during Quickscorer inference.
17:20 CEST TS04.11 EF-IMR: EMBEDDED FLASH WITH INTERLACED MAGNETIC RECORDING TECHNOLOGY
Speaker:
Chenlin Ma, Shenzhen University, CN
Authors:
Chenlin Ma, Xiaochuan Zheng, Kaoyi Sun, Tianyu Wang and Yi Wang, Shenzhen University, CN
Abstract
Interlaced Magnetic Recording (IMR), a technology that improves storage density through track overlap, introduces signiffcant latency due to Read-Modify-Write (RMW) operations. Writing to overlapped tracks affects underlying tracks, requiring additional I/O operations to read, back up, and rewrite them, resulting in signiffcant head movement latency. We propose EF-IMR, a new architecture that ensures crash consistency in IMR while minimizing RMW latency and head movement. EF-IMR reduces head movement during RMW operations and decreases redundant RMW operations. Evaluations under real-world, intensive I/O workloads show that EF-IMR reduces RMW latency by 20.11% and head movement latency by 89.37% compared to existing methods.

TS05 System-level design methodologies and high-level synthesis

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS05.1 IMPROVING LLM-BASED VERILOG CODE GENERATION WITH DATA AUGMENTATION AND RL
Speaker:
Kyungjun Min, Pohang University of Science and Technology, KR
Authors:
Kyungjun Min, Seonghyeon Park, Hyeonwoo Park, Jinoh Cho and Seokhyeong Kang, Pohang University of Science and Technology, KR
Abstract
Large language models (LLMs) have recently attracted significant attention for their potential in Verilog code generation. However, existing LLM-based methods face several challenges, including data scarcity and the high computational cost of generating prompts for fine-tuning. Motivated by these challenges, we explore methods to augment training datasets, develop more efficient and effective prompts for fine-tuning, and implement training methods incorporating electronic design automation (EDA) tools. Our proposed framework for fine-tuning LLMs for Verilog code generation includes (1) abstract syntax tree (AST)-based data augmentation, (2) output-relevant code masking, a prompt generation method based on the logical structure of Verilog code, and (3) reinforcement learning with tool feedback (RLTF), a fine-tuning method using EDA tool results. Experimental studies confirm that our framework significantly improves syntax and functional correctness, outperforming commercial and non-commercial models on open-source benchmarks.
16:35 CEST TS05.2 SPARDR:ACCELERATING UNSTRUCTURED SPARSE DNN INFERENCE VIA DATAFLOW OPTIMIZATION
Speaker:
Wei Wang, Beihang University, CN
Authors:
Wei Wang, Hongxu Jiang, Runhua Zhang, Yongxiang Cao and Yaochen Han, Beihang University, CN
Abstract
Unstructured sparsity is becoming a key dimension in exploring the inference efficiency of neural networks. However, its data layout presents irregularity, making it difficult to match the parallel computing mode of hardware, resulting in low computational and memory access efficiency. We have studied this issue and found that the main reason is that existing sparse acceleration libraries and compilers perform sparse matrix multiplication optimization exploration through the splitting and reconstruction of sparse patterns, thus ignoring the acceleration of sparse convolution operations centered on data streams, which may miss some optimization opportunities for sparse operations. In this article, we propose SparDR, a general sparse convolution operation acceleration method centered around data streams. Through novel feature map data stream reconstruction and convolutional kernel data representation, redundant zero value calculations are effectively avoided, addressing efficiency is improved, and memory overhead is reduced. SparDR is based on TVM and allows for automatic scheduling across different hardware configurations. Compared with the current mainstream five methods on four types of hardware, the inference delay acceleration reaches 1.1-12x and the memory usage decreases by 20%.
16:40 CEST TS05.3 AN IMITATION AUGMENTED REINFORCEMENT LEARNING FRAMEWORK FOR CGRA DESIGN SPACE EXPLORATION
Speaker:
Liangji Wu, Southeast University, Nanjing, Jiangsu Province, CN
Authors:
Liangji Wu, Shuaibo Huang, Ziqi Wang, Shiyang Wu, Yang Chen, Hao Yan and Longxing Shi, Southeast University, CN
Abstract
Coarse-Grained Reconfigurable Arrays (CGRAs) are a promising architecture that warrants thorough design space exploration (DSE). However, Traditional DSE methods for CGRAs often get trapped in local optima due to singularities, i.e., invalid design points caused by CGRA mapping failures. In this paper, we propose a singularity-aware framework based on the integration of reinforcement learning (RL) and imitation learning (IL) for DSE of CGRAs. Our approach learns from both valid and invalid points, substantially reducing the probability of sampling singularities and accelerating the escape from inefficient regions, ultimately achieving high-quality Pareto points. Experimental results demonstrate that our framework improves the hypervolume (HV) of the Pareto front by 23.56% compared to state-of-the-art methods, with a comparable time overhead.
16:45 CEST TS05.4 OPERATION DEPENDENCY GRAPH-BASED SCHEDULING FOR HIGH-LEVEL SYNTHESIS
Speaker:
AOXIANG QIN, Sun Yat-sen university, CN
Authors:
Aoxiang Qin1, Minghua Shen1 and Nong Xiao2
1Sun Yat-sen University, CN; 2The School of Computer, Sun Yat-sen University,Panyue, CN
Abstract
Scheduling determines the execution order and time of operations in program. The order is related to operation dependencies, including data and resource dependencies. Data dependencies are intrinsic in programs, while resource dependencies are determined by scheduling methods. Existing scheduling methods lack an accurate and complete operation dependency graph (ODG), leading to poor performance. In this paper, we propose an ODG-based scheduling method for HLS with GNN and RL. We adopt GNN to perceive accurate relations between operations. We use the relations to guide an RL agent in building a complete ODG. We perform feedback-guided iterative scheduling with the graph to converge to a high-quality solution. Experiments show that our method reduces 23.8% and 16.4% latency on average, compared with the latest GNN-based and RL-based methods, respectively.
16:50 CEST TS05.5 LOCALITY-AWARE DATA PLACEMENT FOR NUMA ARCHITECTURES: DATA DECOUPLING AND ASYNCHRONOUS REPLICATION
Speaker:
Shuhan Bai, Huazhong University of Science and Technology, CN
Authors:
Shuhan BAI1, Haowen Luo2, burong Dong3, Jian Zhou1 and Fei Wu4
1Huazhong University of Science and Technology, CN; 2HuaZhong university of science and technology, CN; 3Huazhong university of science and technology, CN; 4huazhong university of science and technology, CN
Abstract
Non-Uniform Memory Access (NUMA) architectures bring new opportunities and challenges to bridge the gap between computing power and memory performance. Their complex memory hierarchies feature non-uniform access performance, known as NUMA locality, indicating data placement and access without NUMA-awareness significantly impact performance. Existing NUMA-aware solutions often prioritize fast local access but at the cost of heavy replication overhead, suffering a read-write performance tradeoff and limited scalability. To overcome these limitations, this paper presents $
m{Ladapa}$, a scalable and high-performance locality-aware data placement strategy. The key insight is decoupling data into metadata and data layers, allowing independent management with adaptive asynchronous replication for lower overhead. Additionally, $
m{Ladapa}$ employs multi-level metadata management leveraging fast caches for efficient data location, further boosting performance. Experimental results show that $
m{Ladapa}$ outperforms typical replication techniques by up to 27.37$ imes$ in write performance and 1.63$ imes$ in read performance.
16:55 CEST TS05.6 HAVEN: HALLUCINATION-MITIGATED LLM FOR VERILOG CODE GENERATION ALIGNED WITH HDL ENGINEERS
Speaker:
Yiyao Yang, Shanghai Jiao Tong University, CN
Authors:
Yiyao Yang1, Fu Teng2, Pengju Liu1, Mengnan Qi1, Chenyang Lv1, Ji Li3, Xuhong Zhang2 and Zhezhi He1
1Shanghai Jiao Tong University, CN; 2Zhejiang University, CN; 3Independant Researcher, CN
Abstract
Recently, the use of large language models (LLMs) for Verilog code generation has attracted great research interest to enable hardware design automation. However, previous works have shown a gap between the ability of LLMs and the practical demands of hardware description language (HDL) engineering. This gap includes differences in how engineers phrase questions and hallucinations in the code generated. To address these challenges, we introduce HaVen, a novel LLM framework designed to mitigate hallucinations and align Verilog code generation with the practices of HDL engineers. HaVen tackles hallucination issues by proposing a comprehensive taxonomy and employing a chain-of-thought (CoT) mechanism to translate symbolic modalities (e.g. truth tables, state diagrams, etc.) into accurate natural language descriptions. Furthermore, HaVen bridges this gap by using a data augmentation strategy. It synthesizes high-quality instruction-code pairs that match real HDL engineering practices. Our experiments demonstrate that HaVen significantly improves the correctness of Verilog code generation, outperforming state-of-the-art LLM-based Verilog generation methods on VerilogEval and RTLLM benchmark. HaVen is publicly available at https://github.com/Intelligent-Computing-Research-Group/HaVen.
17:00 CEST TS05.7 ENABLING MEMORY-EFFICIENT ON-DEVICE LEARNING VIA DATASET CONDENSATION
Speaker:
Gelei Xu, University of Notre Dame, US
Authors:
Gelei Xu1, Ningzhi Tang1, Jun Xia1, Ruiyang Qin1, Wei Jin2 and Yiyu Shi1
1University of Notre Dame, US; 2Emory University, US
Abstract
Upon deployment to edge devices, it is often desirable for a model to further learn from streaming data to improve accuracy. However, learning from such data is challenging because it is typically unlabeled, non-independent and identically distributed (non-i.i.d), and only seen once, which can lead to potential catastrophic forgetting. A common strategy to mitigate this issue is to maintain a small data buffer on the edge device to select and retain the most representative data for rehearsal. However, the selection process leads to significant information loss since most data is either never stored or quickly discarded. This paper proposes a framework that addresses this issue by condensing incoming data into informative synthetic samples. Specifically, to effectively handle unlabeled incoming data, we propose a pseudo-labeling technique designed for on-device learning environments. We also develop a dataset condensation technique tailored for on-device learning scenarios, which is significantly faster compared to previous methods. To counteract the effects of noisy labels during the condensation process, we further utilize a feature discrimination objective to improve the purity of class data. Experimental results indicate substantial improvements over existing methods, especially under strict buffer limitations. For instance, with a buffer capacity of just one sample per class, our method achieves a 56.7% relative increase in accuracy compared to the best existing baseline on the CORe50 dataset.
17:05 CEST TS05.8 TAICHI: EFFICIENT EXECUTION FOR MULTI-DNNS USING GRAPH-BASED SCHEDULING
Speaker:
Xilang Zhou, Fudan University, CN
Authors:
Xilang Zhou, Haodong Lu, Tianchen Wang, Zhuoheng Wan, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN
Abstract
Deep Neural Networks (DNNs) are increasingly used for complex tasks (e.g., AR/VR) by constructing different types of DNNs into a workflow. However, efficient frameworks are lacking for accelerating these applications which have complex connec- tivity and require real-time processing. We introduce ReFA, an FPGA-based co-design framework for acceleration of real-time multi-DNN workloads. Specifically, on the hardware level, we develop an FPGA-based multi-core accelerator, which adopts an unified template for various DNN models and supports depth- first execution to reduce data movements. On the software level, we design a lightweight scheduler based on genetic algorithm, which can find high quality scheduling strategies rapidly from a huge solution space. Our evaluations show that ReFA deployed on Xilinx Alveo U200 achieves up to 10.1-37.3× and 1.4-1.5× reduction in job completion time (JCT), compared with CPU and GPU, respectively. Furthermore, ReFA gains 6.1-9.3×, 7.9×, 5.6-7.1×, and 2.4× reduction in energy-delay product, compared with GPU, Planaria, Herald and H3M, respectively.
17:10 CEST TS05.9 VTOT: AUTOMATIC VERILOG GENERATION VIA LLMS WITH TREE OF THOUGHTS PROMPTING
Speaker:
Xiangyu Wang, National University of Defense Technology, CN
Authors:
Yingjie Zhou1, Renzhi Chen2, Xinyu Li1, Jingkai Wang1, Zhigang Fang1, Bowei Wang1, Wenqiang Bai1, Qilin Cao1 and Lei Wang3
1National University of Defense Technology, CN; 2Qiyuan Laboratory, CN; 3Academy of Military Sciences, CN
Abstract
The automatic generation of Verilog code using Large Language Models (LLMs) presents a compelling solution for enhancing the efficiency of hardware design flow. However, the state-of-the-art performance of LLMs in Verilog generation remains limited when compared to programming languages such as Python. Previous research, Chain of Thought (CoT) has demonstrated that incorporating intermediate reasoning steps can significantly improve the performance of LLMs in code generation. In this paper, we propose the Verilog Tree of Thoughts (VToT) method. This structured prompting technique addresses the abstraction gap between Verilog and CoT by embedding hierarchical design constraints within the prompt. Experimental results on the VerilogEval and RTLLM benchmarks demonstrate that VToT prompting enhances both the syntactic and functional correctness of the generated code.Specifically. Under the RTLLM benchmark, VToT achieved a correctness rate of 75.9\% at pass@5, representing an improvement of 10.4\%. Furthermore, in the VerilogEval benchmark, VToT achieved state-of-the-art performance with a correctness rate of 52.4\% at pass@1 (an increase of 8.9\%) and 65.4\% at pass@5 (an increase of 9.6\%).
17:11 CEST TS05.10 SIGNAL PREDICTION FOR DIGITAL CIRCUITS BY SIGMOIDAL APPROXIMATIONS USING NEURAL NETWORKS
Speaker:
Josef Salzmann, TU Wien, AT
Authors:
Josef Salzmann and Ulrich Schmid, TU Wien, AT
Abstract
Investigating the temporal behavior of digital circuits is a crucial step in system design, usually done via analog or digital simulation. Analog simulators like SPICE iteratively solve the differential equations characterizing the circuits' components numerically. Although unrivaled in accuracy, this is only feasible for small designs, due to the high computational effort even for short signal traces. Digital simulators use digital abstractions for predicting the timing behavior of a circuit. We advocate a novel approach, which generalizes digital traces to traces consisting of sigmoids, each parameterized by threshold crossing time and slope. For a given gate, we use an artificial neural network for implementing the transfer function that predicts, for any trace of input sigmoids, the parameters of the generated output sigmoids. By means of a prototype simulator, which can handle circuits consisting of inverters and NOR gates, we demonstrate that our approach operates substantially faster than an analog simulator, while offering a much better accuracy than a digital simulator.
17:12 CEST TS05.11 VERILUA: AN OPEN SOURCE VERSATILE FRAMEWORK FOR EFFICIENT HARDWARE VERIFICATION AND ANALYSIS USING LUAJIT
Speaker:
Chuyu Zheng, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, CN
Authors:
Ye Cai1, Chuyu Zheng1, Wei He2 and Dan Tang3
1Shenzhen University, CN; 2Beiiing Institute of Open Source Chip, CN; 3Institute of Computing Technology, Chinese Academy of Sciences (lCT) / Beiiing Institute of Open Source Chip, CN
Abstract
The growing complexity of hardware verification highlights limitations in existing frameworks, particularly regarding flexibility and reusability. Current methodologies often require multiple specialized environments for functional verification, waveform analysis, and simulation, leading to toolchain fragmentation and inefficient code reuse. This paper presents Verilua, a unified framework leveraging LuaJIT and the Verilog Procedural Interface (VPI), which integrates three core functionalities: Lua-based functional verification, a scripting engine for RTL simulation, and waveform analysis. By enabling complete code reuse through a unified Lua codebase, the framework achieves a 12× speedup in RTL simulation compared to cocotb and a 70× improvement in waveform analysis over state-of-the-art solutions. Through consolidating verification tasks into a single platform, Verilua enhances efficiency while reducing tool fragmentation and learning overhead, addressing critical challenges in modern hardware design.

CFI-SD Career Fair - Industry : Speed dating

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 17:30 CEST - 18:30 CEST


PhDF PhD forum

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 18:30 CEST - 20:00 CEST

Session chair:
Christian Pilato, Politecnico di Milano, IT

Session co-chair:
Dirk Stroobandt, Ghent University, BE

The PhD Forum is a great opportunity for PhD students to present their work to a broad audience in the system design and design automation community from both industry and academia, and to establish contacts for entering the job market. Representatives from industry and academia get a glance of state-of-the-art in system design and design automation. The PhD Forum is hosted by EDAA, ACM SIGDA, and IEEE CEDA.

Time Label Presentation Title
Authors
18:30 CEST PhDF.1 ADAPTIVE HARDWARE FOR ENERGY-EFFICIENT FPGA-BASED DATA CENTERS
Speaker and Author:
Mattia Tibaldi, Politecnico di Milano, IT
Abstract
Modern applications require the elaboration of massive amounts of data. Due to the computational power needed, such applications may execute in data centers that consume immense energy. In 2020, data centers contributed to 2% of the world's carbon emissions, with an increasing trend. Google recently announced the development of small nuclear reactors to power larger data centers with zero carbon emissions. Although this is a possible direction in the development of sustainable data centers, the solution may not always be applicable as several states impose restrictions on the use of nuclear energy. For this reason, the study of hardware and software solutions does not lose its importance. So, designers must guarantee a high quality of the result and efficiently manage the energy required by the computation to reduce costs and carbon production. Many data centers are moving towards heterogeneous architectures equipped with specialized hardware to achieve high performance and power savings. These architectures with the customization can significantly minimize energy consumption, while hardware parallelism can optimize the execution time. However, such components have limited flexibility. Once designed, they cannot execute the functionality differently. Also, the energy consumption is fixed and depends on the implementation of the architecture. This research proposes implementing an adaptive system based on FPGA to guarantee flexibility, develop different versions of a computation component, and select the version at run time. In this way, based on the stimuli coming from the environment, such as the intensity of the incoming traffic or the data formats, it will be possible to use logic with different energy profiles. This approach allows us to design an accelerator with a power efficiency of 25× respect to a CPU and a 40% final reduction in carbon emissions.
18:30 CEST PhDF.2 SAFETY CONCEPT AND SIMULATION-BASED APPROVAL OF AN AUTOMATED DRIVING FUNCTION FOR THE TRANSVERSE GUIDANCE OF VEHICLES
Speaker and Author:
Marzana Khatun, University of Ulm, DE
Abstract
The growing interest and demand for automated driving systems has turned self-driving vehicles from science fiction to practical reality. However, Automated vehicles (AVs) face the critical challenge of outperforming human drivers and failed to gain the public's trust. Establishing safety concepts in advance of the development phases is essential for the reliability of automated driving systems. Emerging safety concepts emphasize rigorous proofs, development processes and applicable methods to ensure the safety of such systems. These concepts take into account the inherent complexity of automated technologies and the need for continuous improvement, incorporating both new and refined technologies and methods. This work proposes a general safety concept through: (a) Scenario-based extended Hazard Analysis and Risk Assessment (HARA) for transverse guidance of a vehicle, and used as reference for safety-related threat scenarios description for vehicle functions such as Over-The-Air (OTA) update, (b) Scenario reduction approaches to support simulation-based approval for collision detection use cases focusing on L3 or higher level of automation (L4/L5), and (c) Evaluation of management systems interaction for autonomous vehicles.
18:30 CEST PhDF.3 DETECTION AND REPAIR OF DEFECTS IN RTL CODE USING STATIC ANALYSIS AND GENERATIVE AI
Presenter:
Baleegh Ahmad, New York University, US
Author:
Baleegh Ahmad, New York University, US
Abstract
Problems associated with hardware bugs have gained increasing importance over the past decade. In particular, detecting and fixing security bugs has been the focus of a lot of academic and industrial efforts. It is crucial to detect defects in hardware as early as possible to reduce costs, efforts and damage to reputation down the line. Existing techniques lack the breadth to detect a variety of defects and the scalability to apply generalizable solutions over many digital designs. Both these deficiencies can be addressed by developing solutions at an earlier stage of the system-on-chip ( SoC) development life-cycle. This work provides some strategies to apply bug detection and repair techniques without needing a full fledged testing framework. These strategies are employed at the register-transfer level ( RTL ) code by looking at the structure and elements of the code and information around the code like specifications and comments and more general guidelines for secure code like the Common Weakness Enumerations (CWEs). One of the main research gap is that existing solutions for bug detection and repair in RTL rely on design specific information and techniques to be implemented. This makes the solutions not generalizable. Additionally, security verification is by its nature non-exhaustive. Some vulnerabilities are only understood when the vulnerability is exploited, so there is no way of knowing what to look for beforehand. Another limitation of current approaches is that the solutions do not have the ability to ‘learn' from their previous issues and solutions. Our work aims to address these solutions by i) improving generalizability through moving away from design-specific frameworks and implementing scanners that utilize broad range of vulnerabilities i.e., CWEs, ii) focusing on security related bugs to produce security aware linters and fix security bugs by using LLMs and iii) using the ability of LLMs to execute detection and repair of bugs at RTL, showing how generative AI based tools can have success by working out solutions from what they had learned before during training and fine-tuning and context-based learning.
18:30 CEST PhDF.4 LOW-POWER TIME-DOMAIN HARDWARE ACCELERATOR FOR EDGE COMPUTING
Speaker:
Jie Lou, RWTH Aachen University, DE
Authors:
Jie Lou and Tobias Gemmeke, RWTH Aachen University, DE
Abstract
Efficient computing is becoming increasingly crucial for energy-constrained edge devices. With the rapid adoption of artificial neural networks (ANNs), reducing energy consumption has emerged as a pressing research challenge to enable effective edge computing. Time-domain (TD) computing has attracted attention for its inherent analog signal processing properties and compatibility with digital circuits. Unfortunately, it remains unclear which scenarios are best suited for computation in the time domain. This thesis focuses on developing hardware accelerators for time-domain computing, analyzing suitable application domains, and identifying the principles and constraints that should be considered during ASIC implementation. Specifically, we design, tapeout and measure both standard and custom cell-based time-domain compute-in-memory (TDCIM) accelerators for binary neural networks (BNNs) and convolutional neural networks (CNNs), as well as standard cell-based TD decoder for low-density parity-check (LDPC) codes, using 22nm FDSOI technology to validate the performance of time-domain computing. Besides, we also develop a software simulation frameworks that account for hardware TD noise in neuromorphic and LDPC applications.
18:30 CEST PhDF.5 SIMULATION TECHNIQUES FOR RAPID SOFTWARE DEVELOPMENT AND VALIDATION
Speaker:
Mohammadreza Amel Solouki, Politecnico di Torino, IT
Authors:
Mohammadreza Amel Solouki and Massimo Violante, Politecnico di Torino, IT
Abstract
Ensuring reliability under Random Hardware Failures (RHFs) in safety-critical embedded systems requires robust fault tolerance measures. My research proposes innovative methods for enhancing fault detection and mitigation through Control Flow Checking (CFC) and Software-Implemented Hardware Fault Tolerance (SIHFT) techniques. Hardening strategies are often applied in embedded systems to mitigate RHFs, either by using specialized hardware or employing SIHFT methods. However, most existing approaches in the literature target soft errors and are implemented in low-level languages such as Assembly. This complicates compliance with functional safety standards, which increasingly advocate for high-level programming languages like C. Addressing these challenges, my research focuses on integrating SIHFT methods directly into high- level programming workflows to streamline fault detection and mitigation processes while adhering to industry standards like ISO 26262. This study tackles the practical challenges of implementing fault tolerance in embedded systems, bridging the gap between theoretical models and real-world applications. By leveraging high-level programming languages and adhering to international safety standards like ISO 26262, this work advances the state of the art in embedded system reliability and lays a foundation for future developments in fault-tolerant computing. High-level language implementations simplify adherence to ISO 26262 by enabling better trace ability, easier maintenance, and compliance with mandated software development workflows.
18:30 CEST PhDF.6 SOFTWARE AND HARDWARE CO-OPTIMIZATION FOR GRAPH NEURAL NETWORKS ON FPGA
Presenter:
Ruiqi Chen, Vrije Universiteit Brussel, BE
Authors:
Ruiqi Chen1, Kun Wang2 and Bruno da Silva1
1Vrije Universiteit Brussel, BE; 2Fudan University,
Abstract

Deep Neural Networks (DNNs) are proliferating in numerous AI applications, thanks to their high accuracy. For instance, Convolution Neural Networks (CNNs), one variety of DNNs, are used in object detection for autonomous driving and have reached or exceeded the performance of humans in some objection detection problems. Commonly adopted CNNs such as ResNet and MobileNet are becoming deeper (more layers) while narrower (smaller feature maps) than early AlexNet and VGG. Nevertheless, due to the race for better accuracy, the scaling up of DNN models, especially Transformers — another variety of DNNs, to trillions of parameters and trillions of Multiply-Accumulate (MAC) operations, as in the case of GPT-4, during both training and inference, has made DNN models both data-intensive and compute-intensive, carrying heavier workloads on the memory capacity to store weights and computation. This poses a significant challenge for the deployment of these models in an area-efficient and power-efficient manner. Given these challenges, model compression is a vital research topic to alleviate the crucial difficulties of memory capacity from the algorithmic perspective. Pruning, quantization, and entropy coding are three directions of model compression for DNNs. The effectiveness of pruning and quantization can be enhanced with entropy coding for further model compression. Entropy coding focuses on encoding the quantized values of weights or features in a more compact representation by utilizing the peaky distribution of the quantized values, to achieve a lower number of bits per variable, without any accuracy loss. Currently employed Fixed-to-Variable (F2V) entropy coding schemes such as Huffman coding and Arithmetic coding are inefficient to be decoded in the hardware platforms, suffering from high decoding complexity of O(n · k), where n is the number of codewords (quantized values) and k is the reciprocal of compression ratio. Read more ...

18:30 CEST PhDF.7 SEMI-TENSOR PRODUCT OF MATRICES AND ITS APPLICATION IN LOGIC SYNTHESIS
Speaker:
Hongyang Pan, Fudan University, CN
Authors:
Hongyang Pan1, Zhufei Chu2 and Fan Yang1
1Fudan University, CN; 2Ningbo University, CN
Abstract
In recent years, a new theory called semi-tensor product (STP) has emerged [8]. By studying the topological structure of Boolean networks, STP transforms logical dynamic systems into discrete dynamic systems, thereby realizing logical reasoning through matrix multiplication. The STP approach offers a promising way to address the current challenges of logic synthesis. By defining the logic matrix as a circuit primitive, STP converts the logic network into matrix multiplication while retaining the topological information between circuits. The STP method can represent any network, including technology-independent representations and technologydependent ones (standard cells or LUTs), thereby achieving the unification of circuit representation.
18:30 CEST PhDF.8 EXPLORING LAYER-FUSED MAPPING OF DNNS ON HETEROGENEOUS DATAFLOW ACCELERATORS
Speaker:
Arne Symons, KU Leuven, BE
Authors:
Arne Symons1 and Marian Verhelst2
1MICAS, KU Leuven, BE; 2KU Leuven, BE
Abstract
The rapid advancements in deep neural networks (DNNs) have led to increased computational complexity, memory demands, and energy consumption, posing significant challenges for edge applications. Heterogeneous dataflow accelerators (HDAs), leveraging multi-core and chiplet-based architectures, offer specialized processing for diverse DNN workloads. However, traditional layer-by-layer scheduling strategies often result in high off-chip memory traffic and underutilized cores. This work introduces Stream, a design space exploration framework for optimizing layer-fused DNN mappings on HDAs. Layer fusion, or depth-first scheduling, minimizes off-chip data transfers and enhances core utilization by processing outputs through a stack of fused layers. Stream integrates fine-grained dependency modeling, memory- and communication-aware performance analysis, and a constraint-based optimization engine to deliver significant improvements in latency and energy efficiency. Stream's efficacy is validated on state-of-the-art architectures, achieving >95% accuracy in latency modeling for accelerators like DepFiN and DIANA. Comparative studies show up to 2.2× energy-delay product (EDP) improvements for activation-dominant workloads like MobileNetV2 under layer fusion. Additionally, scalability studies highlight Stream's adaptability to hardware configurations, demonstrating optimal core distributions as processing element budgets increase. All methodologies and results are open-source, enabling further innovation and adoption: https://github.com/kuleuven-micas/stream.
18:30 CEST PhDF.9 INVESTIGATING SECURITY ISSUES IN PROGRAMMABLE LOGIC CONTROLLERS AND RELATED PROTOCOLS
Speaker and Author:
Wael Alsabbagh, IHP – Leibniz Institute for High Performance Microelectronics, DE
Abstract
Programmable Logic Controllers (PLCs) play a substantial role in Critical Infrastructures (CIs) and Industrial Control Systems (ICSs). They are programmed with a control logic program that determines how to control and operate physical processes such as nuclear power plants, petrochemical factories, water treatment systems, and many others. Unfortunately, these devices are not fully secured and remain vulnerable to malicious attacks, particularly those targeting the control logic of PLCs. Such threats, known as control logic injection attacks, are designed to manipulate industrial processes, potentially causing catastrophic damages, as exemplified by the Stuxnet attack [1]. This thesis investigates various security issues and vulnerabilities associated with PLCs and their communication protocols, with a primary focus on control logic injection attacks. Our objective is to analyze the security mechanisms of both non-cryptographically and cryptographically protected PLCs, assessing the effectiveness of vendor-implemented safeguards. Siemens PLCs were selected for experimentation due to their widespread use in industrial environments. Figure 1 illustrates the methodology employed in this study, outlining our research workflow and experimental steps.
18:30 CEST PhDF.10 SECURE AND SCALABLE HARDWARE FOR POST-QUANTUM CRYPTOGRAPHY AND FULLY HOMOMORPHIC ENCRYPTION
Speaker and Author:
Aikata Aikata, TU Graz, AT
Abstract
Secure communication and privacy-preserving computation are the cornerstones of modern-day digital interactions. As the world becomes more interconnected, ensuring the confidentiality and integrity of both communication and computation is essential to safeguarding sensitive data and maintaining trust in digital systems. With the advent of quantum computing, traditional public key cryptographic schemes face obsoletion, making it crucial to develop new technologies that can uphold these pillars in a quantum-enabled future. Thus, this thesis focuses on secure communication and privacy-preserving computation through advancements in Post-Quantum Cryptography (PQC) and Fully Homomorphic Encryption (FHE). Efficient, compact, and secure hardware architectures for PQC are developed. Key contributions include the first unified hardware designs for NIST-standardized Digital Signature (CRYSTALS-Dilithium) and Key Encapsulation (CRYSTALS-Kyber), compact and resource-efficient implementations, agile designs that accommodate future algorithmic changes, and defence techniques against side-channel attacks (via masking). The research also addresses the challenges of cost-effective hardware acceleration for FHE to enable efficient privacy-preserving computation. In this direction, a major highlight is the pioneering scalable multi-chiplet architectures that achieve significant performance gains while reducing fabrication costs by 50%. The thesis also introduces the first hardware implementation of a Hybrid Homomorphic Encryption (HHE) scheme- Pasta, achieving 97x speedup over existing solutions. Furthermore, new fault analysis techniques have been developed to emphasize the need for continued security research. The work further optimizes FHE applications for privacy-preserving neural network evaluations. To summarize, this thesis develops secure and scalable hardware solutions for advanced cryptographic techniques- PQC and FHE which play a key role in the adoption of digital security and privacy in the post-quantum era. Several proposed works have also been open-sourced to foster further innovation in this domain.
18:30 CEST PhDF.11 LEARNING-BASED METHODS FOR ENABLING ON-EDGE, ACCURATE, SUSTAINABLE, AND HUMAN-CENTERED INTELLIGENT MANUFACTURING
Speaker:
Luigi Capogrosso, Università di Verona, IT
Authors:
Luigi Capogrosso, Marco Cristani and Franco Fummi, Università di Verona, IT
Abstract
Four major evolutions of industrialization have occurred throughout human history, impacting economic growth, population expansion, and significant social transformations. Industry 5.0 is regarded as the next industrial revolution, and its objective is to leverage the creativity of human experts in collaboration with efficient, accurate, and intelligent machines. In this context, the transformation of industrial resources into intelligent objects capable of sensing, acting, and adapting leads to intelligent manufacturing. To comprehensively enhance manufacturing systems capabilities, this thesis presents cutting-edge learning-based techniques around four key pillars of intelligent manufacturing: efficient edge-cloud computing, accurate anomaly detection, sustainability, and human-centered systems design. The results obtained are shown in Figure 1, which presents the real-world setup of the Industrial Computer Engineering (ICE) Laboratory of the University of Verona, where the presented contributions were tested and evaluated.
18:30 CEST PhDF.12 HIGH-DENSITY AND RELIABLE COMPUTE-IN-MEMORY CIRCUITS AND ARCHITECTURES FOR BIG DATA PROCESSING
Speaker:
Hongtao Zhong, Department of Electronic Engineering, Tsinghua University, CN
Authors:
Hongtao Zhong and Xueqing Li, Tsinghua University, CN
Abstract
Recent big data applications require process massive data with limited time or power budget, but conventional Von Neumann architecture suffer from the "memory wall" challenge. Computing-in-Memory (CIM) technology is an emerging computing paradigm and promising to overcome this challenge. Besides, content addressable memory (CAM) that is originally used in cache and routing has been discovered to be capable of in-memory searching that can accelerate many search applications. Although existing CiM/CAM designs have shown significant energy efficiency improvement, the low memory density and low reliability are still two big obstacles on the path towards "real computing/searching in memory". To address these challenges, we propose a series of cross-layer explorations from devices to architectures and applications, including the following three parts: i) High-density and low-power eDRAM memory and CiM circuits based on NEM relay and AFeFETs with low refresh overhead; ii) Energy-efficient and reliable charge-domain CiMs and CAMs with high memory density thanks to the proposed dense cell and cluster design; iii) High-density 3D memory based domain-specific architecture (DSA) designs with a series of algorithm-architecture optimizations that achieve end-to-end high speedup and high energy efficiency. These works above push the frontiers towards higher density, higher reliability, and further more practical CiM/CAM designs.
18:30 CEST PhDF.13 HARDWARE RELIABILITY ASSESSMENT AND ENHANCEMENT FOR DEEP NEURAL NETWORKS
Speaker and Author:
Mohammad Hasan Ahmadilivani, Tallinn University of Tehnology, EE
Abstract
Due to the high capabilities of DNNs in solving various tasks, they are widely adopted in safety-critical applications such as automotive, space, and healthcare. A major concern in designing a system for such use cases is hardware reliability. To address the hardware reliability concerns in DNN deployment, their fault resilience should be first assessed and then enhanced. With the growth of DNN exploitation, the size of emerging DNNs in terms of the amount of parameters and computations is rapidly rising. It poses a huge complexity to their reliability assessment and enhancement, necessitating efficient and innovative solutions to reduce complexity and overheads. In this thesis, some of the most significant challenges of reliability assessment and enhancement for DNNs are identified and addressed to enable the exploitation of DNNs in safety-critical applications. My thesis presents the first Systematic Literature Review (SLR) focused exclusively on all methods of reliability assessment for DNNs, exploring the methods of reliability assessment for DNNs, classifying them, and identifying the existing gaps and challenges in the field. For reliability assessment, it addresses the scalability problem for the first time by introducing a novel semi-analytical and metric-oriented method. Moreover, this thesis introduces multiple cost-effective fault-tolerant techniques for DNNs, applicable to a wide range of DNN accelerators. Many methods in this thesis are open-source to enable researchers and engineers in this field, to quickly evaluate DNNs' reliability and design fault-tolerant DNNs.
18:30 CEST PhDF.14 TOWARD RELIABLE AI ACCELERATORS
Presenter:
Eleonora Vacca, Politecnico di Torino, IT
Authors:
Eleonora Vacca and Luca Sterpone, Politecnico di Torino, IT
Abstract
Deploying deep neural networks (DNNs) in safety-critical systems, such as autonomous vehicles and medical diagnostics, demands high performance and reliability. Traditional approaches to enhance reliability, such as hardware redundancy, impose significant computational and energy overheads, making them unsuitable for practical use, especially in large-scale or resource-constrained systems. This research proposes a novel hardware-software co-design strategy to improve the reliability of Systolic Array (SA) accelerators, a key component for efficient DNN computation. The approach introduces error self-detection mechanisms that fully utilize the existing functional paths of the accelerator, eliminating the need for additional hardware. Furthermore, zero-overhead algorithmic techniques are developed to mitigate faults by leveraging insights into fault propagation and system behavior. These innovations enhance the fault tolerance of SA accelerators without increasing computational, memory, or energy costs, providing a scalable solution for reliable DNN performance in critical applications.
18:30 CEST PhDF.15 DESIGN AND SIMULATION OF ATOMIC-SCALE COMPUTING: BRIDGING COMPUTER SCIENCE, ELECTRICAL ENGINEERING, AND PHYSICS
Speaker:
Jan Drewniok, TU Munich, DE
Authors:
Jan Drewniok and Robert Wille, TU Munich, DE
Abstract
As AI and digital transformation accelerate, traditional computing architectures struggle to meet the soaring demand for energy efficiency. Silicon Dangling Bond (SiDB) technology stands out as a post-CMOS candidate, offering robust, scalable, and energy-efficient computing. However, despite its immense potential to revolutionize atomic-scale computing, the progress of SiDB technology has been hindered by a lack of interdisciplinary collaboration among computer scientists, electrical engineers, and physicists. This disconnect, caused by the absence of shared design rules and software tools to enforce interdisciplinary requirements, has limited the integration of hardware designs with computational strategies. To address these challenges, this thesis introduces a comprehensive framework for the SiDB technology that bridges these interdisciplinary gaps. Key contributions include the development of highly efficient physical simulators, achieving runtime improvements of up to a factor of 5000, and SiDB logic design algorithms with a runtime improvement of up to a factor of 63. Moreover, the thesis proposes the establishment of design rules---such as temperature behavior, defect analysis, and operational domain exploration---together with efficient algorithms to determine them for the first time for the SiDB technology. These advancements enable the automatic design of realistic and robust SiDB circuits, paving the way for real-world applications. In an effort to support open research and reproducibility, all aforementioned methodologies have been implemented into open-source tools and made publicly available on GitHub and PyPI. By bridging disciplines through this comprehensive framework, the thesis positions the SiDB technology as a viable and sustainable solution to address the escalating computational and energy demands of the future.
18:30 CEST PhDF.16 LEARNING-BASED ANALOG ICS LAYOUT AUTOMATION
Speaker and Author:
Davide Basso, University of Trieste, IT
Abstract
Analog integrated circuits layout has always been a challenging task, requiring sophisticated manual expertise to achieve optimal results. This thesis proposes a novel approach to streamline and accelerate this procedure by leveraging machine learning techniques. Specifically, we utilize a reinforcement learning agent to sequentially place devices on a chip canvas. The placement process is complemented by a Steiner tree-based global routing algorithm for driving connectivity. To enhance generalization capabilities, our pipeline uses a graph neural network, ensuring robust performance across various layout scenarios. This innovative approach is seamlessly integrated into Infineon's procedural layout generator, enabling users to maintain high-quality standards while significantly reducing manual effort. Experimental results demonstrate the efficiency of our method, achieving a reduction in complete layout generation runtimes to 67.3% compared to traditional manual techniques.
18:30 CEST PhDF.17 PHYSICAL DESIGN FOR FIELD-COUPLED NANOCOMPUTING
Speaker:
Simon Hofmann, Chair for Design Automation, TU Munich, DE
Authors:
Simon Hofmann and Robert Wille, TU Munich, DE
Abstract
The growing demand for computational power, coupled with the limitations of Moore's Law and rising energy consumption of CMOS technologies, necessitates alternative computing paradigms. Field-coupled Nanocomputing (FCN) offers a promising solution by utilizing the repulsion of physical fields instead of electrical current for ultra-low-power computation at the nanoscale. Recent advances, such as sub-30 nm OR gates using Silicon Dangling Bonds (SiDBs), have renewed interest in FCN. However, constraints like planarity requirements, complex clocking schemes, and the need for signal synchronization pose significant challenges in physical design. This thesis addresses these challenges by developing novel physical design algorithms and tools to enhance the efficiency and scalability of FCN circuit design. We introduce NanoPlaceR, a reinforcement learning-based tool that reduces layout area by 50% compared to prior methods. Building upon this, we present gold, an algorithm that further reduces area overhead by 24% and accelerates the design process by 460 times. To enable cross-technology compatibility, we develop an algorithm that transforms layouts between Cartesian grids (used in Quantum-dot Cellular Automata) and hexagonal grids (required by SiDB gates), bridging different FCN technologies without extensive redevelopment. Furthermore, we propose post-layout optimization and wiring reduction techniques tailored to FCN, achieving additional area savings. We also introduce MNT Bench, a comprehensive benchmark suite providing gate-level layouts and network descriptions, and implement all methodologies in open-source tools within the Munich Nanotech Toolkit (MNT), promoting reproducibility and collaboration in FCN design automation. These contributions advance the state-of-the-art in FCN physical design, providing scalable and efficient solutions crucial for realizing FCN technologies in the post-CMOS era.
18:30 CEST PhDF.18 TOWARDS SOUND AND COMPLETE ANALYSIS OF INTEGRATED CIRCUITS AT TRANSISTOR-LEVEL
Speaker:
Oussama Oulkaid, Université Grenoble Alpes, FR
Authors:
Oussama Oulkaid1, Matthieu Moy2, Pascal Raymond3, Bruno Ferres4 and Mehdi Khosravian5
1University Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France - University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France - Aniah, 38000 Grenoble, France, FR; 2University Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France, FR; 3University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France, FR; 4University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, FR; 5Aniah, FR
Abstract
Circuit verification is an undoubtedly complex task. It is both costly and time-consuming, e.g., 50–60 % is the median time spent in the verification of Application Specific Integrated Circuit (ASIC) designs with respect to the whole project time. In this work, we focus on a specific aspect of circuit verification, that is the verification of electrical properties at transistor-level. We present transistor-level semantics, and we show how they can be used in the context of electrical verification. We demonstrate the use of our approach for missing level-shifter detection, and we present prospects for extending the work to a form of reliability analysis.
18:30 CEST PhDF.19 FULL-STACK SYSTEM DESIGN AND PROTOTYPING FOR PRACTICAL PHOTONIC-ELECTRONIC NEUROCOMPUTING
Presenter:
Yinyi Liu, The Hong Kong University of Science and Technology, HK
Author:
Yinyi Liu, The Hong Kong University of Science and Technology, HK
Abstract
The proliferation of more intelligent neural-network models continuously demands higher computing performance. Despite the superior processing speed and energy efficiency of integrated photonic circuits, the practical realization of their full potentials remains far from being unleashed due to the lack of mature and comprehensive full-stack ecosystem supports. Existing works on system-level design often ignore low-level details such as memory transactions or scheduling peripherals external to the photonic chip. As a result, there is currently no available toolchain that enable the seamless and convenient migration of design simulation results to physical implementation. In this study, we propose a comprehensive solution that covers both the software and hardware stacks to address this gap. Our toolchain includes an MLIR-based compiler that translates neural applications described in Python into an ELF executable file, using a customizable RISC-V ISA with photonic instructions specifically designed to run on photonic cores. The resulting executable can then be utilized in functional simulation or transferred to our reconfigurable hardware template for agile physical verification of the design. We anticipate that researchers and developers will utilize photonic-electronic neurocomputing more effectively in real-world applications by leveraging our proposed toolchain.
18:30 CEST PhDF.20 DYNAMIC MEMORY MANAGEMENT OPTIMIZATIONS OVER HETEROGENEOUS MEMORY SYSTEMS
Presenter:
Manolis Katsaragakis, National TU Athens, GR
Authors:
Manolis Katsaragakis1, Francky Catthoor2 and Dimitrios Soudris1
1National TU Athens, GR; 2IMEC, BE
Abstract
This PhD focuses on the development of systematic methodology for providing source code organization, data structure refinement, exploration and placement over emerging memory technologies. The goal is to extract alternative solutions, aiming to provide multi-criteria trade-off over different optimization aspects, such as memory footprint, accesses, performance and energy consumption.
18:30 CEST PhDF.21 SYSTEM-LEVEL DESIGN IN THE ERA OF BRAIN-COMPUTER INTERFACES
Presenter:
Guy Eichler, Columbia University, US
Authors:
Guy Eichler and Luca Carloni, Columbia University, US
Abstract
Brain-computer interfaces (BCIs) have emerged in the 1960s. Since then, BCI applications focus on enabling a better understanding of the brain, on providing the foundations for machine learning models, and ultimately, they promise to provide a direct channel between the brain and the outside world. With advancements in the fields of Internet-of-Things (IoT) and machine learning (ML) at the edge, self-contained BCI systems that can acquire neural signals from the brain, communicate wirelessly, and execute computational kernels to process neural data, have recently transitioned to the forefront of research and development. Big efforts are currently being made to evolve BCIs from non-invasive, low-resolution, wearable devices into invasive, high-resolution, implanted systems-on-chip (SoCs). However, unlike typical devices, implant-based, self-contained BCI systems must operate under strict safety requirements due to the sensitivity of the brain tissue to heat. Contingently, to date, not a single self-contained, implant-based BCI system has yet to be successfully tested in-vivo (live subjects), proven safe, and made available to the public. Thus, I state that constructing pragmatic BCI systems necessitates a holistic, system-level approach that emphasizes specialized hardware design, while accounting for the brain as an integral component of the BCI system. I sustain my statement by contributing according to three parallel BCI timelines, which I define as follows: 1) Pre-BCI Era - Our current time. Designing implant-based BCI systems that support large-scale neural data acquisition, wireless communication, and setting the groundwork for real-time computation within the BCI system. The goal is to reach the point we can have a functional and safe, self-contained BCI system. 2) Intra-BCI Era - The near future. Assuming that BCI systems are available, I develop a methodology to support the integration of BCI applications into the BCI system through hardware accelerator design, design-space exploration, and utilization of the brain as a resource in the system. I design hardware accelerators for BCI algorithms and for brain-based random-number generation, to support BCI applications in the BCI system. Furthermore, we would like to support scalability of the system. 3) Post-BCI Era - The not so far future. Assuming that we have BCI systems that integrate computation, I integrate biologically inspired computation on specialized SoCs, to support better interfacing between ML and biological neurons. The goal is for BCI systems to ultimately function as cognitive co-processors for the brain.
18:30 CEST PhDF.22 ML-BASED RESOURCE MANAGEMENT OF RECONFIGURABLE SYSTEMS IN THE CLOUD-EDGE CONTINUUM
Speaker:
Juan Encinas, Universidad Politécnica de Madrid, ES
Authors:
Juan Encinas1, Alfonso Rodríguez1 and Andres Otero2
1UPM, ES; 2Universidad Politecnica de Madrid, ES
Abstract
Field-Programable Gate Arrays (FPGAs) are commonly used in the embedded domain because they provide better energy efficiency than a Graphical Processing Unit and a competitive performance compared to Application Specific Integrated Circuit. Moreover, Dynamic and Partial Reconfiguration can be used to modify part of the implemented logic at run time without interfering the rest of the system, obtaining an outstanding flexibility. Traditionally, in reconfigurable embedded systems, the applications to be accelerated and the relationships between them are known at design time and therefore, a design space exploration process is typically performed to decide when to reconfigure each accelerator to maximize performance and reduce power consumption. However, there are other scenarios where workloads (i.e., arrival order of a diverse set of accelerators) are not known at design time. This is the case of computing offloading scenarios where the FPGAs are placed in the cloud-edge continuum and work as acceleration engines. In these scenarios, FPGAs usually deal with dynamic workloads where requirements vary on demand, requiring run-time decisions to keep the optimal operating point, preserving the expected Quality of Service and power constrains. In order to make more informed decisions, the hardware accelerators must be characterized in terms of power consumption and execution performance. Doing this analytically is unfeasible due to the large amount of variables involved on a real scenario where multiple accelerators are executed simultaneously. In this thesis, models based on ML techniques are proposed as a mechanism to predict power consumption and performance in reconfigurable multi-accelerator systems, since ML algorithms are particularly good at finding these complex relationships between multiple factors. Specifically, an incremental modeling approach has been implemented to characterize at run time upcoming workloads, updating the prediction models with new observations. A smart scheduler is also included to make resource management decisions based on the predictions of the incremental ML-based models. In addition, complementary infrastructures are also proposed for managing the dynamic workloads in FPGAs and monitoring the system, collecting the power consumption and performance traces used to train the models. Moreover, this solution has been design following a microservice-based approach to enable the seamless deployment of hardware-accelerated functions to any platform across the continuum
18:30 CEST PhDF.23 POWER, PERFORMANCE, AND THERMAL TRADE-OFFS IN MANYCORE ARCHITECTURES
Speaker and Author:
Gaurav Narang, Washington State University, US
Abstract
Non-Volatile Memory (NVM) based crossbars suffer from various non-idealities that affect the overall inferencing accuracy. To address that, the matrix-vector-multiplication operations are computed by activating a subset of the full crossbar, referred to as Operation Unit (OU). However, OU configurations (sizes) vary with the neural layers' features such as sparsity, kernel size, and their impact on predictive accuracy. We consider the problem of learning appropriate layer-wise OU configurations in ReRAM crossbars for unseen DNNs at runtime such that the performance is maximized without loss in predictive accuracy. We develop a machine learning (ML) based framework called Odin, which selects the OU sizes for different neural layers as a function of the neural layer features and time-dependent ReRAM conductance drift. Our experimental results demonstrate that the EDP is reduced by up to 8.7× over state-of-the-art homogeneous OU configurations without compromising predictive accuracy.
18:30 CEST PhDF.24 SOLVING COMBINATORIAL OPTIMIZATION PROBLEMS IN CAD WITH RRAM-BASED UNIVERSAL ISING MACHINE
Speaker:
Wenshuo Yue, Peking University, CN
Authors:
Wenshuo Yue and Bonan Yan, Peking University, CN
Abstract
Ising machines are annealing processors that leverage the physical dynamics of Ising graphs to address combinatorial optimization problems (COP). Nevertheless, these machines are constrained to problems with specific graph topologies due to their inherent fixed spin configuration and connectivity. This thesis work explores hardware-software co-design approaches to develop a novel paradigm of hardware Ising machine, the universal Ising machine (UIM), enabled by resistive random-access memory (RRAM) and Compute-in-memory (CIM) technology. It effectively accelerates solving COP in computer-aided design (CAD) problems. (1) This work designs and fabricates a multifunctional RRAM chip to integrate content-addressable memory, compute-in-memory, and random number generator in one chip. (2) This work proposes a novel paradigm, a universal Ising machine, that supports arbitrary Ising graph topology with adaptive low-cost hardware. The approach, interaction-centric storage, is suitable for any Ising graph and reduces the memory scaling cost. We experimentally implement the Ising machine on a 40nm RRAM CIM chip. (3) This work proposes a hardware-software co-design technique that, for the first time, maps a practical CAD problem into UIM. We use the UIM to solve max-cut and graph coloring problems, with the latter showing a 442–1450 factor improvement in speed and a 4.1e5–6.0e5 factor reduction in energy consumption compared to GPU. When applying to a realistic CAD problem, multiple patterning lithography layout decomposition, the UIM achieves 390–65,550 times speedup compared to the ILP algorithm on CPU.
18:30 CEST PhDF.25 ACTIVE ROOT-OF-TRUST ARCHITECTURES FOR LOW-END EMBEDDED SYSTEMS
Presenter:
Youngil Kim, University of California, Irvine, US
Author:
Youngil Kim, University of California, Irvine, US
Abstract
This paper identifies key limitations in current RoT architectures for low-end IoT devices and introduces active RoT-s: CASU, EILID, and TRAIN. These architectures are constructed via a software/hardware co-design, incorporating minimal hardware modifications to support security features. We implement all RoT-s on an open-source MSP430 core, a representative microcontroller for low-end IoT devices, and demonstrate their feasibility with real-world applications. Experimental results indicate that they achieve low runtime and hardware overhead.
18:30 CEST PhDF.26 SYSTEMATIC DESIGN AND EFFICIENT AUTOMATED IMPLEMENTATION OF LOGIC LOCKING
Speaker:
Akashdeep Saha, New York University Abu Dhabi, AE
Authors:
Akashdeep Saha1, Debdeep Mukhopadhyay2 and Rajat Subhra Chakraborty2
1New York University Abu Dhabi, AE; 2IIT Kharagpur, IN
Abstract
The escalating costs of IC fabrication have driven the adoption of fabless operations and a "horizontal" business model, emphasizing outsourcing across the IC supply chain. While this approach reduces production costs and accelerates time-to-market, it introduces vulnerabilities that result in billions of dollars in losses due to threats such as IP piracy [1], infringement, IC overproduction, and hardware Trojan insertions. Logic locking has emerged as a proactive defense mechanism, protecting designs by integrating key-based logic to thwart these potent supply chain threats. In my PhD, we explored advances in logic locking across three aspects. First, it identifies vulnerabilities in sequential and combinational advanced state-of-the-art locking techniques. We introduce novel attacks like ORACALL [7] and DIP Learning [5], compromising Cellular Automata (CA)-based FSM obfuscation and CAS-Lock, respectively. Secondly, it proposes countermeasures by enhancing the non-linearity of CA structures [8] to mitigate attacks without affecting their crucial properties and utilized cryptographic SPN-based block cipher's security in designing robust logic locking. Finally, we present MIDAS, an end-to-end CAD framework automating logic locking techniques across multiple paradigms. By unifying diverse approaches and leveraging graph-based analysis, MIDAS establishes a robust foundation for scalable, secure logic obfuscation.
18:30 CEST PhDF.27 AGING PHENOMENA IN DIGITAL CIRCUITS: CHARACTERIZATION, MITIGATION AND EXPLOITATION.
Speaker:
Andres Santana Andreo, Instituto de Microelectrónica de Sevilla, IMSE, CNM (CSIC, Universidad de Sevilla), ES
Authors:
Andres Santana Andreo1, Rafael Castro Lopez2, Elisenda Roca2 and Francisco Fernandez2
1Instituto de Microelectrónica de Sevilla, IMSE, CNM (CSIC, Universidad de Sevilla), ES; 2Instituto de Microelectronica de Sevilla, IMSE, CNM (CSIC, Universidad de Sevilla), ES
Abstract
The enormous benefits that CMOS technology scaling has brought have come along with an increase in variability. Not only Time-Zero Variability (TZV), existing after fabrication, but Time-dependent variability (TDV) effects, like aging, are becoming more relevant and damaging and need to be considered during circuit design. Some examples of TDV phenomena are Bias Temperature Instability (BTI) or Hot Carrier Degradation (HCD). These phenomena show a stochastic nature, which makes them much harder to model. To address this issue, stochastic defect-centric models such as the Probabilistic Defect Occupancy (PDO) model are used. Parameter extraction for these models requires massive device characterization so that statistically significant information is obtained. Once these parameters are obtained, the model can be integrated in a simulation tool, and circuit reliability predictions made to prevent the impact of aging on the final design. Accuracy is critical, as overcompensation leads to unnecessary performance loss and undercompensation to early circuit failure. Specifically, in digital circuits, aging generally results in a longer propagation delay for logic gates, ultimately leading to potential timing violations. This thesis tackles the issue of TDV in digital circuits around three pillars: Characterization (by employing a novel chip design to characterize the aging degradation of individual logic gates), Modelling (by accurately modeling the circuit degradation under complex workloads with advanced compression techniques, introducing accurate guardbands into the design flow) and Exploitation (by employing the knowledge on TDV to produce reliable and cheap hardware security primitives).
18:30 CEST PhDF.28 DIGITAL TWINS IN AIRCRAFT: MERGING CYBER-PHYSICAL SYSTEM AND HUMAN DECISION-MAKING
Speaker:
Francesco Biondani, Università di Verona, IT
Authors:
Francesco Biondani and Franco Fummi, Università di Verona, IT
Abstract
The aviation industry is undergoing a profound digital transformation fueled by advancements in Artificial Intelligence (AI), the metaverse, and cybersecurity. At the forefront of this transformation are Digital Twins (DT), which hold immense potential for enhancing operational efficiency and safety. However, implementing Digital Twins on resource-constrained, in-service aircraft presents significant challenges. This research addresses these challenges from two complementary perspectives: Cyber-Physical Systems (CPS) and human-centered design. From the CPS perspective, a power-efficient digital twin framework has been developed and tailored specifically for predictive maintenance. Concurrently, the research leverages the metaverse to collect edge-case data and simulate human behavior in decision-making scenarios, bridging technological innovation with human factors to advance aviation safety and efficiency.
18:30 CEST PhDF.29 FAULTY BEHAVIORS SIMULATION IN INDUSTRIAL CYBER-PHYSICAL SYSTEMS FOR SAFETY ANALYSIS
Speaker and Author:
Francesco Tosoni, Università di Verona, IT
Abstract
Recently, industrial evolution has been on the rising edge due to the Industry 4.0 phenomenon. The Industrial Cyber-Physical Systems (ICPSs) that compose smart factories are increasingly complex and interconnected among each other and humans. In such a context, functional safety is crucial for production, economic and legal reasons. Maintaining the correctness of the system functionality is achieved by monitoring the machine status and key parameters during its working phase. Creating virtual models and behavioral simulations are powerful tools for producing solid ICPSs, and the safety measures required by such environments. Despite the complexity of creating these models, simulation is the key to the design of not only the main system but also the surrounding production environment. In order to analyze the system's behavior, multi-domain behavioral fault taxonomies have been produced and tested in simulation on different case studies. Fault injection and simulation methodology have been applied in the Verilog-AMS environment, as well as Simulink and SystemC. In addition, an exploration of the potential of game engines as simulators of physical systems is ongoing, due to the high accuracy at the graphics rendering level. The same fault models have also been useful for developing Time-Sensitive Behavioral Contracts (TSBC) based fault detection mechanisms. Simulation models of the system under analysis enable the design and refinement of the contracts defined in the monitors. Future developments involve applying the same methodology to mixed-signal systems, thus including the system control part as well.
18:30 CEST PhDF.30 HIGH-PERFORMANCE AND FLEXIBLE HARDWARE ARCHITECTURES FOR FPGA-BASED SMARTNICS
Speaker:
Klajd Zyla, TU Munich, DE
Authors:
Klajd Zyla and Andreas Herkersdorf, TU Munich, DE
Abstract
A recent approach the research community has proposed to address the rise in computing demands associated with the significant growth of network traffic is in-network computing. This paradigm shift is bringing about an increase in the number of tasks executed by network devices. As a result, processing demands are becoming more diverse, requiring flexible packet-processing architectures. State-of-the-art approaches provide a high degree of flexibility at the expense of performance for complex computations, or they ensure high performance but only for specific use cases. In my PhD thesis work, I proposed and developed high-performance and flexible hardware architectures tailored for FPGA-based SmartNICs, including a novel crossbar switch design and a novel NoC router design. I conducted experiments with synthetic and real-world network traffic to demonstrate their feasibility and advantages compared with state-of-the-art approaches. I focused on the following metrics: throughput, latency, and FPGA resource usage.
18:30 CEST PhDF.31 THE ACCELERATION OF GAUSSIAN BELIEF PROPAGATION USING RECONFIGURABLE HARDWARE
Presenter:
Omar Sharif, Imperial College London, GB
Author:
Omar Sharif, Imperial College London, GB
Abstract
Gaussian Belief Propagation (GBP) is an iterative method of performing probabilistic inference over factor graphs. Factor graphs, which represent relationships between variables and factors as bipartite structures, enable efficient statistical inference through message-passing algorithms. GBP is one such algorithm which finds extensive application in domains such as simultaneous localization and mapping (SLAM) and image denoising, where approximate solutions to joint probability distributions are sufficient, making it a promising candidate for hardware acceleration in modern robotic systems. Despite its utility, GBP faces significant compute challenges when scaled to large graphs, especially in hardware-constrained environments. Our previous work during the PhD (which was featured in DATE 2024) featured a framework for designing scalable GBP processors using streaming architectures to process large graphs effectively. Our framework achieved remarkable improvements in performance efficiency (i.e. inference per watt), making it an extremely desirable solution for edge applications. However, scalability limitations remained. To address this, our current work (to be featured in DATE 2025) introduces a novel scheduler based on information gain in message passes to prioritize node updates and therefore reduce wasted computations. By dynamically prioritizing nodes for update, and double buffering stream inputs, we achieve significant improvements in both processing and convergence rates for equal resources.
18:30 CEST PhDF.32 DOMAIN-SPECIFIC BENCHMARKS AND ARCHITECTURES FOR APPLICATIONS USING GRAPH-BASED DATA
Presenter:
Andrew McCrabb, University of Michigan, US
Authors:
Andrew McCrabb and Valeria Bertacco, University of Michigan, US
Abstract
Graph processing is foundational to modern applications like social networks, recommendation systems, and machine learning. In this dissertation, we identify that these applications span three distinct categories: graph-as-data-framework, graph-as-algorithmic-framework, and graph-as-both-frameworks, each presenting unique computational challenges to improve performance. Graph-as-data-framework applications, such as PageRank, require improved memory bandwidth and data organization. Graph-as-algorithmic-framework applications, like Random Forests, demand more parallelism and bandwidth. Graph-as-both-framework applications, exemplified by Graph Neural Networks (GNNs), require a combination of all three. To improve the performance of these applications, this dissertation introduces three custom Processing-in-Memory (PIM) hardware accelerator designs, each tailored to one category of graph application. DREDGE addresses dynamic graph workloads through adaptive vertex partitioning to reduce communication overhead. ACRE accelerates tree-based ensemble learning while enabling explainable models. GLEAM optimizes GNN aggregation functions for enhanced efficiency and scalability. Finally, to support future advancements, this work presents DyGraph and BeXAI benchmark suites for evaluating dynamic graph processing and explainable AI tasks, respectively. Together, these contributions help solve key challenges toward improving the performance of graph-based processing applications using both current and future memory technologies.
18:30 CEST PhDF.33 TOWARDS PERSONALIZED AI HEALTHCARE BACKED BY EMERGING TECHNOLOGIES
Speaker:
Ruiyang Qin, University of Notre Dame, US
Authors:
Ruiyang Qin and Yiyu Shi, University of Notre Dame, US
Abstract
In modern healthcare, personalized strategies that account for individual genetic, lifestyle, and medical histories are outperforming traditional, one-size-fits-all approaches. This shift is propelled by technological advancements, notably wearable edge devices equipped with conversational large language models (LLMs). However, these LLM-based solutions often fail to fully account for specific genetic and behavioral factors, relying predominantly on broad historical data from the general population, which limits their effectiveness in providing real-time, customized health advice and alerts. This limitation is exacerbated by the limited processing capabilities of current edge devices, impairing their responsiveness and reliability in critical scenarios such as suicide prevention or stroke detection. To address these challenges, I, in collaboration with a team of local psychologists and physicians from Indiana University School of Medicine, have developed a comprehensive cross-layer design from AI algorithms to emerging techniques like FeFET-based CiM architectures, for on-device AI personalization in healthcare applications. My thesis integrates four complementary components: Data Selection, RAG on CiM, Prompt Tuning via NVCiM, and Tiny-Align, each addressing distinct yet interconnected aspects of LLM-based personalization on resource-constrained edge devices. I have published the first three of them as the first author.
18:30 CEST PhDF.34 ENERGY-EFFICIENT MIXED-SIGNAL IN-SENSOR AND IN-MEMORY COMPUTING
Speaker:
Md Abdullah-Al Kaiser, University of Wisconsin - Madison, US
Authors:
Md Abdullah-Al Kaiser1 and Akhilesh jaiswal2
1University of Wisconsin - Madison, US; 2U Wisconsin Madison, US
Abstract
Modern computing systems face two critical challenges: the cognitive wall and the memory wall. The cognitive wall represents the challenge faced by edge devices in Internet of Things (IoT) and artificial intelligence (AI) applications when trying to process large amounts of sensory data while operating with limited power and efficiency. The memory wall, on the other hand, highlights the growing performance gap between fast processors and slower memory access, leading to inefficiencies and increased energy consumption. These challenges arise from the conventional separation of sensors, memory, and processing units, which requires frequent data transfers between these components. This segmentation not only leads to higher energy consumption and processing delays but also impedes the efficiency of data transfer. To address these bottlenecks, there is a critical need for more integrated solutions that embed computation directly within sensors and memory. Hence, this research introduces two innovative solutions:- (1) In-sensor computing through a hybrid CMOS+X architecture for neuromorphic vision sensors (NVS), combining CMOS transistors with magnetic domain-wall magnetic tunnel junctions (MDW-MTJs) for parallel, asynchronous, and energy-efficient computation at the pixel level. This approach reduces backend-processor energy consumption by 45.3%, while maintaining high accuracies of 97.82% on NMNIST, 79.17% on CIFAR10-DVS, and 95.99% on IBM DVS128-Gesture. (2) In-memory computing through a differential cross-coupled photonic SRAM (pSRAM)-augmented photonic tensor core for ultra-fast, low-energy matrix computations. The pSRAM achieves read/write speeds of 20 GHz with a switching energy of just 0.6 pJ, significantly improving matrix multiplication speed and efficiency. By embedding computation directly into sensors and memory, this work effectively addresses both the cognitive and memory walls, leading to significant energy savings and enhanced system performance. These integrated solutions offer a promising path forward for next-generation, energy-constrained, and data-intensive applications, particularly in fields such as IoT and AI.
18:30 CEST PhDF.35 PRINTED NEUROMORPHIC COMPUTING FOR ULTRA-RESOURCE-CONSTRAINED EDGE INTELLIGENCE
Speaker:
Priyanjana Pal, Karlsruhe Institute of Technology, DE
Authors:
Priyanjana Pal and Mehdi Tahoori, Karlsruhe Institute of Technology, DE
Abstract
With the evolution of next-generation electronics, expec- tations for fast-moving-consumer-goods (FMCG) electronics have grown significantly. In applications like on-skin electron- ics, such as smart band-aids, comfort and biocompatibility are key concerns, while in other areas, such as smart packaging and smart labels, the demand for ultra low-cost, disposable electronics has become essential. Traditional silicon-based electronics, although have significantly evolved in recent years, remain limited by their bulky substrates and complex man- ufacturing processes, making them unsuitable for these new demands. Printed Electronics has emerged as a promising alternative, using simple manufacturing techniques by deposit- ing functional inks onto flexible substrates, reducing manu- facturing costs, time, and enabling features like nontoxicity, flexibility, and biodegradability. However, PE face challenges due to their larger feature sizes and lower device counts, necessitating analog signal processing to bypass expensive ADCs costs. Addressing the inherent challenges of variability, fault tol- erance, and energy limitations in printed electronics requires robust design strategies to ensure reliability and performance. The main aim of this dissertation is to design and optimize printed neuromorphic circuits (pNCs) for robust, energy- efficient, and scalable applications in IoT, wearables, and edge computing. By addressing variability, fault tolerance, and manufacturing constraints, it leverages methods like Neural Architecture Search (NAS), energy-efficient computing, and adaptive mechanisms for temporal data processing to develop cost-effective, reliable, and bespoke pNCs for next- generation ultra-low-cost flexible electronics.
18:30 CEST PhDF.36 ENERGY-EFFICIENT ACCELERATORS FOR ML APPLICATIONS WITH IMPROVED RRAM DEVICE LIFETIME
Speaker:
Neethu K, School of Engineering, CUSAT, IN
Authors:
Neethu K1, Rekha James1 and Sumit Mandal2
1School of Engineering, Cochin University of Science and Technology, IN; 2Indian Institute of Science, IN
Abstract
Modern 2.5D systems built on in-memory computing (IMC)-based devices are well suited for DNN operations. However, they do not typically address the issue of large storage needs and high on-package as well as on-chip communication volume required during DNN training tasks. Different chiplets in the 2.5D system communicate with each other and with the storage device(s) using an interconnection mechanism called network-on-package (NoP). Studies show that majority of the total communication energy is consumed by DRAM NoP communication. Hence, there is a need to construct an energy-efficient 2.5D system with IMC to perform DNN training. Moreover, the state-of-the-art IMC devices used for DNN accelerator design is RRAM-based. Owing to the low endurance of RRAM devices, limited number of weight updates can be performed while training different networks. To this end, we also propose an adaptive layer selection approach for DNN training to improve lifetime of 2.5D system with RRAM-based IMC devices.
18:30 CEST PhDF.37 IMPLEMENTATION AND EVALUATION OF DIFFERENT STRATEGIES OF COUNTERMEASURES TO PROTECT A RISC-V CORE AGAINST BOTH SOFTWARE AND PHYSICAL ATTACKS
Speaker:
William Pensec, Université Bretagne Sud, Lab-STICC, FR
Authors:
William PENSEC1, Vianney Lapotre2 and Guy Gogniat3
1Université Bretagne Sud, UMR CNRS 6285, Lab-STICC, FR; 2University Bretagne Sud, UMR CNRS 6285, Lab-STICC, FR; 3Université Bretagne Sud, FR
Abstract
Nowadays, IoT devices face many threats. As these devices manipulate sensitive data, they need to be protected against both software and physical attacks. A solution against software attacks is to use a Dynamic Information Flow Tracking (DIFT) mechanism. DIFT techniques can detect various software attacks, such as memory overflow, SQL injections, etc, by attaching and propagating tags to information containers at runtime. A security policy allows determining its behaviour. If a malicious behaviour is detected, an alert can be raised. Several implementations have been studied in the literature: hardware, software, and hybrid. Information containers will differ on which type of DIFT is used; these range from files to registers. Hardware DIFT solutions can be grouped into two main categories: off-core and in-core. Off-core DIFT relies on a dedicated coprocessor to perform tag-related operations. This approach does not require internal processor modification and reduces the computation load on the main processor. In-core DIFT leads to significant invasive modification of the processor. Tag-related operations are spread over the pipeline stages and are computed in parallel with the data computations. Compared to the off-core approach, it does not require specific communication and synchronisation management. In this work, we consider the D-RI5CY processor implementing an in-core hardware DIFT. We analyse its behaviour against Fault Injection Attacks (FIA). FIA can be performed by disturbing the power supply, or the clock, using EM pulse or laser shots. Numerous studies have demonstrated the vulnerabilities of critical systems against FIAs, glitch injections on the power supply have been used to manipulate the program counter (PC). These physical attacks effectively bypass protection mechanisms, allowing attackers to hijack the targeted system. Our objective is to develop effective countermeasures against FIA to efficiently protect the D-RI5CY DIFT mechanism in order to have a robust system against software and physical attacks.
18:30 CEST PhDF.38 SUPPORTING END USERS IN IMPLEMENTING QUANTUM COMPUTING APPLICATIONS
Speaker:
Nils Quetschlich, TU Munich, DE
Authors:
Nils Quetschlich and Robert Wille, TU Munich, DE
Abstract
Quantum computing has made tremendous improvements in both software and hardware that have sparked interest in academia and industry to realize quantum computing applications. To this end, several steps are necessary: choosing a suitable quantum algorithm, encoding it into a quantum circuit, selecting a suitable device, compiling the circuit accordingly, executing it, and finally decoding the result. These steps are rather tedious and error-prone and thus create a high entry barrier for end users with limited quantum computing expertise who need solutions to domain-specific problems. This situation is even worsened, since bad choices in the described steps can lead to pure noise and no usable solution in the worst case.
18:30 CEST PhDF.39 FAULT-TOLERANT CNN ACCELERATOR WITH RECONFIGURABLE CAPABILITIES
Speaker:
Rizwan Tariq Syed, IHP GmbH - Leibniz Institute for High Performance Microelectronics, DE
Authors:
Rizwan Tariq Syed and Milos Krstic, Leibniz-Institut für innovative Mikroelektronik, DE
Abstract
Mapping AI models on hardware faces significant challenges due to high computation and energy requirements. The continuously varying AI requirements and workloads, add more to the existing challenge and cause hardware resource utilization to quickly reach a boundary. These challenges grew further for safety-critical applications, which require high-reliability standards. Thus, there is a requirement for efficient ways of implementing AI models on hardware, which, along with high reliability, can reconfigure itself to fulfill varying application requirements. This research work focuses on the CNN model, and with the aim to address the above-mentioned challenges, this thesis work presents: 1- Shared layers methodology to efficiently map CNN models on hardware, 2- Fault-tolerant CNN accelerator with reconfigurable capabilities based on shared layer methodology 3- Integration of a multi-purpose on-chip sensor in fault-tolerant reconfigurable CNN accelerator. The results obtained in this research work aim to establish a foundation for the development of fully reconfigurable resilient AI processing systems, thereby solving the reliability, performance, and energy consumption challenges faced by the computational hardware.
18:30 CEST PhDF.40 CONQUERING TIMING UNPREDICTABILITY IN HIGH-LEVEL SYNTHESIS
Speaker:
Carmine Rizzi, ETH Zurich, CH
Authors:
Carmine Rizzi and Lana Josipovic, ETH Zurich, CH
Abstract
Designing hardware is a complex and time-consuming task that requires specialized expertise. High-Level Synthesis (HLS) tools have revolutionized this process by streamlining digital hardware design and making it more accessible. These tools start from high-level programming languages such as C/C++ and produce Register-Transfer Level~(RTL) code to design circuits (e.g., FPGA or ASIC). However, despite their potential, there remains a significant gap in the quality of circuits designed by experienced hardware engineers compared to those generated by HLS tools. This is mainly due to the inability of HLS to account for the effect of lower-level hardware implementation steps. One consequence of this qualitative disparity is the unpredictability of the operating frequency in circuits generated by HLS tools. The main goal of this thesis is to reduce this gap and the discrepancies between the HLS tool timing model and the final circuit's frequency. This represents a fundamental step in producing circuits with HLS that can achieve high and reliable operating frequency.
18:30 CEST PhDF.41 DEEP LEARNING MODELS OPTIMIZATIONS FOR REAL-TIME INTELLIGENT VIDEO ANALYTICS
Speaker:
Michele Boldo, Università di Verona, IT
Authors:
Michele Boldo and Nicola Bombieri, Università di Verona, IT
Abstract
Real-time video analytics is becoming increasingly important in several domains, including healthcare and Industry 5.0. Edge-based processing is emerging as a solution to reduce latency, safeguard privacy, and effectively manage bandwidth constraints. Although Deep Learning (DL) models are very effective, their significant computational requirements pose significant challenges for implementation on low-power edge devices. My thesis introduces two main methodologies to face these challenges. The first one is based on Collaborative Deep Inference. This approach mitigates accuracy degradation by partitioning the DL model across multiple devices. The model is dynamically split between the edge device and a server, with the division point selected based on latency constraints and the computational and transmission conditions. Data quantization and compression techniques are employed to minimize the impact on accuracy while optimizing performance. The second one is based on Online Domanin Adaptation. This methodology focuses on adapting pre-trained DL models to specific deployment scenarios, particularly when real-world data deviate from the training data. Knowledge Distillation is employed to obtain labels at runtime, where a larger, well-trained "teacher" model transfers its knowledge to a smaller, lightweight "student" model. To determine when the "student" model requires retraining, an algorithm based on Singular Value Decomposition (SVD) is used to monitor prediction quality over time without relying on external labels. The results demonstrate that both methodologies achieve high accuracy while significantly reducing energy consumption and enhancing frame rates.
18:30 CEST PhDF.42 COLLECTIVE METHODOLOGIES FOR EFFICIENT HIGH-LEVEL SYNTHESIS
Speaker:
Aggelos Ferikoglou, National TU Athens, GR
Authors:
Aggelos Ferikoglou1, Sotirios Xydis1 and Dimitrios Soudris2
1National TU Athens, GR; 2National Technical University of Athens, GR
Abstract
This PhD thesis is dedicated to developing methodologies that empower users to effectively utilize High-Level Synthesis (HLS) for Field Programmable Gate Arrays (FPGAs). The primary aim is to simplify the complex and time-intensive process of understanding hardware concepts, making HLS accessible to those without prior expertise in the field. The research emphasizes democratizing HLS by offering good starting points for design optimization and ready-to-use tools that enable designers produce high-quality results.
18:30 CEST PhDF.43 OPTIMIZATION OF A REMOTE MONITORING PLATFORM FOR EDGE DEVICES
Presenter:
Mirco De Marchi, Università di Verona, IT
Authors:
Mirco De Marchi and Nicola Bombieri, Università di Verona, IT
Abstract
Healthcare technologies have witnessed significant advancements, especially with the need of remote monitoring platform for safety, training and analysis of human behavior. This study presents the implementation and optimization of a real-time platform for remote motion analysis using Inertial Measurement Unit (IMU) and Camera sensors. We claim that more accurate results can be achieved by mixing Human Pose Estimation (HPE) techniques with information collected by wearables. Specifically, we introduce a matching model that fuses HPE and IMU data to compensate the inaccuracies of low-cost sensors and inaccurate models. Despite this, the presence of multiple deployed models on a resource-constraint device leads to performance degradation. Model compression techniques prove effective at reducing the models computational load while maintaining good accuracy. We design a novel pruning framework for convolutional neural network (CNN) models tailored for edge devices that ensures optimized inference across multiple performance metrics, including accuracy, latency, and energy consumption. The results indicate its effectiveness in balancing model complexity and performance of motion analysis applications in edge devices.
18:30 CEST PhDF.44 A DESIGN SPACE EXPLORATION FRAMEWORK FOR DNN COMPRESSION USING LOW RANK FACTORIZATION
Speaker:
Milad Kokhazadeh, School of Informatics, Aristotle University of Thessaloniki, GR
Authors:
Milad Kokhazadeh1, Georgios Keramidas2 and Vasilios Kelefouras3
1PhD Candidate, Aristotle University of Thessaloniki, GR; 2Aristotle University of Thessaloniki/Think Silicon S.A., GR, GR; 3University of Plymouth, GB
Abstract
Deep neural networks (DNNs) deliver state-of-the-art performance across various applications but are highly computationally demanding, restricting their deployment on resource-limited edge devices. Low-rank factorization (LRF) is a promising technique to reduce complexity and memory footprint of DNNs while maintaining performance. However, challenges remain in optimizing rank selection, balancing memory-accuracy trade-offs, and integrating LRF into training. To address these challenges, we propose two methodologies: a design space exploration (DSE) framework for optimizing LRF configurations and a feature-map similarity-based strategy for compressing convolutional layers. Our approach automates rank selection and dynamically adjusts compression ratios, achieving over 90% parameter reduction while preserving accuracy, enabling efficient DNN deployment on resource-limited platforms.
18:30 CEST PhDF.45 POWER-EFFICIENT APPROXIMATE 4:2 COMPRESSORS FOR IMAGE MULTIPLICATION AND NEURAL NETWORKS
Speaker:
Vinicius Zanandrea, Federal University of Santa Catarina, BR
Authors:
Vini­cius Zanandrea and Cristina Meinhardt, Federal University of Santa Catarina, BR
Abstract
This work proposes two approximate 4:2 compressors, MAX4:2CV1 and MAX4:2CV2, to power efficiency and area optimization. We demonstrate the advantages of employing the proposed 4:2 compressors for partial products reduction in Dadda Tree multipliers. Also, we compare the performance of our proposed circuits with seven approximate 4:2 compressors from the literature and with an exact compressor. The MAX4:2CV2-based multiplier achieved Peak Signal-to-Noise Ratio (PSNR) values of 31 dB on average for pixel-wise image multiplication, indicating acceptable quality results for error-tolerant applications. This proposal reduces delay by up to 50.4%, power consumption by up to 59.2%, and Power-Delay Product (PDP) by up to 79.7% compared to an exact multiplier. Experiments with two datasets demonstrated that using MAX4:2CV1 in approximate multipliers for neural networks can result in comparable accuracy to exact multipliers while reducing power consumption by up to 56%. The set of information provided in this PhD work support designers to choose the best approximate multiplier according to the design requirements.
18:30 CEST PhDF.46 HARDWARE CNN ACCELERATOR DESIGNS CONFIGURED WITH STATISTICALLY ERROR VARIANT APPROXIMATE MULTIPLIERS
Speaker:
Bindu G Gowda, International Institute of Information Technology Bangalore, IN
Authors:
Bindu G Gowda1 and Madhav Rao2
1International Institute of Information Technology, Bangalore, IN; 2International Institute of Information Technology-Bangalore, IN
Abstract
Convolutional Neural Networks (CNNs) are renowned for their exceptional feature extraction capabilities, making them a cornerstone in various applications. However, implementing CNNs in hardware poses challenges due to extensive computational requirements, especially in multipliers, which are the most power-intensive and latency-prone units. Approximate computing techniques have gained attention for their potential to reduce power consumption, enhance performance, and improve space efficiency. Despite the widespread intention to apply approximate computing to AI workloads, the hardware benefits have not been fully realized alongside compromises in network accuracy in the past. This research work commenced with the design of novel error-balanced approximate multipliers (AM), which involved introducing approximation at the partial product reduction stage of the multiplication process using approximate compressors (AC). Two categories of AC designs were proposed considering the statistical mean error and the direction of the error distribution, and 8 distinct configurations of AMs were constructed by strategically positioning these ACs over the generated partial products and in the successive reduction stages, to achieve error-balanced designs showcasing favorable error metrics. The focus of this research work endeavors to introduce approximate multipliers along the convolutional layers of the CNN and thus present a unique framework for designing hardware-efficient and error-resilient on-chip design solutions for accelerating Machine Learning workloads. Adopting AMs relaxes hardware demands but suffers from the drop in network accuracy, and hence, choosing AMs becomes pivotal. Leveraging a precise combination of multipliers along the convolutional layers instead of uniform multipliers throughout the network was found to enhance the network performance. Considering that the exhaustive approach is a highly laborious task, it was important to explore the use of optimization algorithms to arrive at the optimal solution. Single and multi-objective algorithms were further exploited to identify the Pareto-optimal solutions comprising of AM sequences that balance hardware parameters and CNN accuracy loss.
18:30 CEST PhDF.47 SECURING THE TEST INFRASTRUCTURE OF SOCS
Speaker:
Anjum Riaz, IIT Jammu, IN
Authors:
Anjum Riaz and Satyadev Ahlawat, IIT Jammu, IN
Abstract
The IEEE Standard 1687 (IJTAG) has become a widely adopted framework for efficient access to on-chip instruments, enabling functionalities like testing, diagnostics, post-silicon validation, and system health monitoring throughout the lifecycle of System-on-Chip (SoC) devices. However, its lack of integrated security mechanisms exposes the scan network to potential side-channel and malicious instrument attacks, posing risks such as data sniffing, alteration, IP theft, and reverse engineering. Existing solutions leveraging user authorization, cryptographic methods, and secure protocols partially address these vulnerabilities but often fail to scale efficiently or preserve IJTAG's functional flexibility. This thesis proposes a secure extension to the IJTAG standard by introducing the Inherently Secure SIB (ISSIB), designed to safeguard the IJTAG network from unauthorized access while maintaining its dynamic reconfigurability. The ISSIB achieves robust security with minimal area overhead (1.11% compared to standard SIB). Additionally, the scope of ISSIB is extended to secure high-speed Streaming Scan Networks (SSN) with an area overhead of only 1.91%, significantly lower than alternative solutions. Further enhancements include a topology to mitigate data sniffing and alteration threats by ensuring direct data paths between test instruments, avoiding interference by malicious components. Lastly, this study explores leveraging functional ports (e.g., UART) as secure alternatives to Test Access Ports (TAP), reducing access time and data overhead by up to 45.51% and 69.66%, respectively, while maintaining encryption-based security. These advancements address critical IJTAG vulnerabilities, enabling secure and efficient operation across resource-constrained and high-performance SoC environments.
18:30 CEST PhDF.48 OPEN-SOURCE DESIGN OF A LOW-POWER SNN HARDWARE ACCELERATOR FOR EDGE AI
Presenter:
Luca Martis, Università degli studi di Cagliari, IT
Authors:
Luca Martis1 and Paolo Meloni2
1Università degli studi di Cagliari, IT; 2Università degli Studi di Cagliari, IT
Abstract
Edge computing brings data processing closer to the source that generates the data, offering benefits such as reduced la- tency, lower bandwidth usage, and increased system reliability. Implementing artificial intelligence (AI) algorithms at the edge is essential for creating intelligent sensors but poses challenges due to the energy and computational limitations of edge devices. To address these challenges, Spiking Neural Networks (SNNs) have gained attention as a promising AI solution for edge applications due to their energy efficiency. Neuromorphic processors, optimized for the sparse, event- driven nature of SNNs, offer significant energy savings and faster response times. However, the adoption of neuromorphic processors remains constrained by their high costs and the challenges of integrating them with existing edge devices. The objective of this thesis is to develop a hardware accelerator for SNNs tailored to on-edge applications, prioritizing low power consumption and real-time operation. The accelerator's layout will be implemented using open- source Electronic Design Automation (EDA) tools to minimize costs and overcome traditional barriers to hardware innovation, enabling accessible and efficient solutions for edge AI systems.
18:30 CEST PhDF.49 A PROPOSED EDA FLOW FOR ITERATIVE HARDWARE/RESILIENCE CO-DESIGN
Speaker:
Peer Adelt, University of Applied Sciences Hamm/Lippstadt, Germany, DE
Authors:
Peer Adelt1 and Achim Rettberg2
1Hamm-Lippstadt University of Applied Sciences, DE; 2Carl von Ossietzky University Oldenburg, DE
Abstract
The classical HW/SW co-design flow begins with requirement elaboration, where system objectives and constraints are analysed to define functional and non-functional requirements. A high-level system specification is then created, outlining desired functionality and performance targets. Next, the application is partitioned into hardware and software components, guided by factors such as performance, cost, energy efficiency, and development complexity to leverage the strengths of each domain. Building on this flow, the proposed extension focuses on fault detection and resilience. It introduces fault modelling, resilience-oriented partitioning, and robust testing strategies like fault injection and resilience-aware co-simulation. The proposed methodology, demonstrated with several applications for the 32-bit Freedom-E RISC-V platform, is available as open-source software on GitHub under https://github.com/hshl-hmit/fear-v.
18:30 CEST PhDF.50 DESIGN AND APPLICATIONS OF SIMULATED BIFURCATION ISING MACHINES
Presenter:
Tingting Zhang, McGill University, CA
Authors:
Tingting Zhang1 and Jie Han2
1McGill University, CA; 2University of Alberta, CA
Abstract
Ising machines have received growing interest as efficient and hardware-friendly solvers for combinatorial optimization (CO) problems. They search for the absolute or approximate ground states of the Ising model. A simulated bifurcation (SB) Ising machine searches for the solution by solving pairs of differential equations related to the oscillator positions and momenta. It benefits from massive parallelism but suffers from relatively high hardware costs. To enhance efficiency while maintaining high-quality solutions for CO problems, this project attempts to use quantization schemes, stochastic computing-based integrators, and approximate multipliers in SB machines. As example applications, the traveling salesman problem (TSP) and the multi-input multi-output (MIMO) detection problem are explored. Quantized SB machines (QSBMs) use innovative quantization methods to replace costly multiplication operations with simpler ones. Ternary algorithms dynamically simplify calculations, and advanced multi-value approaches improve numerical precision. Implemented on an FPGA, a QSBM with 2048 spins reduces hardware usage by over 50% and delivers 98.5% of the best-known solution in just 0.73 ms. Dynamic stochastic computing offers efficient accumulation operations. Stochastic SB machines (SSBMs) use signed stochastic integrators (SSIs) for numerical integration, achieving significant area reductions. Two SB cell types improve efficiency: one focuses on area savings, and the other on reducing delays. The SSBM demonstrates a significant area reduction of at least 10.62% compared to the latest designs. Floating-point (FP) representations enable accurate CO solutions but demand more hardware. This work proposes hardware-efficient logarithmic FP multipliers, achieving quality solutions with reduced costs. In routing and scheduling, the TSP is mapped to an Ising problem, using dynamic time steps and redundant spins to enhance solution quality and runtime. For MIMO systems, the SB-based detector applies regularization and dropout strategies, achieving lower error rates than traditional methods.
18:30 CEST PhDF.51 COMPUTATION-IN-MEMORY BASED EDGE-AI FOR HEALTHCARE: A CROSS-LAYER APPROACH
Speaker:
Sumit Shaligram Diware, Computer Engineering Lab, Delft University of Technology, The Netherlands, NL
Authors:
Sumit Diware and Rajendra Bishnoi, TU Delft, NL
Abstract
Edge computing for AI (edge-AI) combines data sources with local AI processing hardware, to provide low response latency, alleviate network costs, enhance data privacy/security, and improve service reliability. Computation-in-memory (CIM) presents a promising alternative to conventional hardware for designing energy-efficient and compact edge-AI hardware. It achieves this through in-situ data processing using emerging memory technologies called memristors. CIM-based edge-AI hardware holds significant potential for AI-based healthcare, where it can greatly enhance the human well-being by fast, reliable, and secure processing of medical data. Designing CIM edge-AI hardware for healthcare is a two-phase process spanning six abstraction layers. The first phase involves creating a customized neural network model for the healthcare task and covers the first three abstraction layers (application, algorithm, and optimization). The challenge here is to achieve strong and effective algorithmic performance, while tailoring the model to maximize CIM hardware benefits. The second phase focuses on translating model computations into CIM hardware operations and spans the remaining three abstraction layers (mapping, micro-architecture and circuits, device). Mitigating memristor non-idealities which introduce computational errors in CIM operations becomes the primary challenge in this phase. Moreover, it is crucial to integrate the model and mitigation techniques into a holistic solution as a chip prototype. This thesis addresses these challenges using a cross-layer approach. We first create effective and energy-efficient models for cardiac arrhythmia classification and diabetic retinopathy screening, through contributions across the first three abstraction layers. To translate the models onto CIM hardware without accuracy loss, we identify the key memristor non-idealities and devise mitigation strategies against them by contributing over the remaining three abstraction layers. Lastly, we integrate our arrhythmia classification model and non-ideality mitigation strategies into a chip prototype. Thus, our work covers the full abstraction layer stack and paves the way for enhanced AI-based healthcare.
18:30 CEST PhDF.52 OPTIMIZING LEARNING THROUGH CO-DESIGN IN NEUROMORPHIC COMPUTING
Speaker:
Lakshmi Varshika Mirtinti, Drexel University, US
Authors:
Lakshmi Varshika M and Anup Das, Drexel University, US
Abstract
Deep convolutional neural networks (CNNs) have tradi- tionally relied on GPUs for training and inference, but these platforms face inherent limitations, including memory band- width constraints, data access bottlenecks, and high energy consumption, making them unsuitable for edge and real- time applications. Neuromorphic systems, inspired by spiking neural networks (SNNs), offer a transformative alternative by mimicking biological neural systems to achieve superior computational efficiency and significantly lower power con- sumption. A pivotal innovation in neuromorphic computing is on- chip learning, enabling continuous adaptation to streaming data in real time, akin to human-like dynamic learning. This adaptability allows refinement of decision-making in response to evolving data, overcoming the static and task-specific nature of traditional AI systems. Unsupervised learning plays a critical role in this paradigm, enabling pattern and feature extraction from unstructured, unlabeled data prevalent in real- world applications. Instead of treating hardware and software as independent, separate stages in the development process, co- design aligns them from the start, optimizing the interaction between the two. This approach is particularly important for complex, performance-critical systems. This thesis proposes a design methodology for efficient on-chip training of unsupervised applications on hardware. It introduces (1) an Online Learning Unit (OLU) to address hardware challenges for selective unsupervised learning on neuromorphic platforms and (2) a co-design framework that maps spiking CNN applications to diverse core architec- tures [7]. The following sections outline these contributions and their implications for advancing neuromorphic computing in practical AI solutions.
18:30 CEST PhDF.53 FAULT-TOLERANT TECHNIQUES FOR EMERGING NON-VOLATILE MEMORIES AND NEUROMORPHIC COMPUTING SYSTEMS
Speaker and Author:
Surendra Hemaram, Karlsruhe Institute of Technology, DE
Abstract
The need for high-performance and low-power consumption for modern computing systems has led to aggressive technology scaling, increasingly limiting the potential of CMOS memory technologies. Emerging non-volatile memories (NVMs) have revolutionized data storage, making them viable alternatives to CMOS memories. In the context of a standalone memory, among other NVMs, spin-transfer torque magnetic random access memory (STT-MRAM) is the most promising candidate, as shown by several industrial demonstrations. However, it has some reliability issues, including soft and hard errors. Addressing these failure mechanisms in STT-MRAM is essential to improve reliability and manufacturing yield. Building on the advancements of emerging NVMs, neuromorphic computing systems have also emerged as a promising approach for neural network (NN) computations, which demand massive storage and matrix operation. In particular, the computation-in-memory (CiM) paradigm, based on a crossbar of resistive NVMs, seamlessly integrates storage and computation, addressing the memory wall issue in conventional architecture. Additionally, unlike CiM, digital NN accelerators, which use memory for storage, have also benefited from advancements in memory technologies by using them as on/off-chip NN weight storage. However, ensuring the reliability of neuromorphic systems is challenging, as resistive NVMs are prone to faults like manufacturing defects, non-idealities, and random telegraph noise, resulting in soft and hard errors and degrading the NN accuracy. Therefore, fault-tolerant mechanisms are crucial for reliable NN operation. This thesis explores hardware-efficient fault-tolerant techniques, which employ error-correcting codes in conjunction with architectural modifications to improve the reliability of NVMs and neuromorphic computing systems. In the context of a standalone memory, we focus on STT-MRAM. However, the proposed solutions can also be applied to other NVMs. In the case of neuromorphic computing applications, we introduce fault-tolerant techniques for resistive NVM-based crossbars used in CiM architectures and address the vulnerabilities in the weight memories of digital NN hardware accelerators.
18:30 CEST PhDF.54 IMPROVING THE EFFICIENCY AND SECURITY OF FULLY HOMOMORPHIC MACHINE LEARNING AS A SERVICE
Presenter:
Lars Folkerts, University of Delaware, US
Author:
Lars Folkerts, University of Delaware, US
Abstract
Machine learning services have become increasingly prevalent in users' daily lives, with both individuals and businesses integrating these technologies into a wide range of applications. From personalized recommendations on streaming platforms to advanced analytics in business operations, the benefits of machine learning (ML) are vast and undeniable. However, despite the widespread adoption of these services, privacy concerns have remained a significant barrier to further integration. Let's consider the Machine Learning as a Service (MLaaS) paradigm with an honest-but-curious cloud server. Here, users send their data to the cloud for processing, and the cloud sends the computation result back to the user. This enables them to offload computational costs and leverage the cloud service provider's proprietary statistical models, such as neural networks. However, a curious provider may access and exploit stored data, which could then be sold to advertisers or used to improve the model. Fully homomorphic encryption (FHE) offers a solution by enabling computation on encrypted data without revealing it. FHE allows users to send encrypted data to cloud providers for processing, who can execute algorithms on the ciphertext without revealing the underlying plaintext. After computation, the encrypted result is sent back to the user to decrypt. This approach allows cloud servers to maintain control over their model IP while protecting user data privacy. My thesis provides the foundational groundwork for developing the next generation FHE-based privacy-preserving machine learning. The efficiency improvements include novel binary neural network speedup techniques (REDsec), generative AI algorithms (Tyche) and a FHE-based MLaaS protocol that supports authenticated data storage (Proteus). My research also evaluates the security of several encrypted ML architectures against side-channels, including user data privacy in multi-exit neural networks (FHE-MENNs) and single FHE-layer transformers (Testing Split Model LLMs). These works outline a promising future for secure and feasible private MLaaS computation.
18:30 CEST PhDF.55 OPTIMIZING CONVOLUTIONAL WEIGHT MAPPING FOR ENERGY-EFFICIENT IN-MEMORY CNN INFERENCE
Speaker:
Johnny Rhe, Sungkyunkwan University, KR
Authors:
Johnny Rhe and Jong Hwan Ko, Sungkyunkwan University, KR
Abstract
In-memory computing (IMC) architectures have emerged as one of the most viable options for faster and more power-efficient convolutional neural networks (CNNs) inference. The key challenge in PIM architectures is optimizing the mapping of convolutional weights onto memory arrays to enhance energy efficiency and reduce inference latency. Recent research has introduced mapping methods to facilitate convolution operations within IMC arrays. However, existing approaches often fail to optimize memory usage, as they do not account for variations in array and layer sizes. This limitation results in underutilized resources, increased energy consumption, and large inference accuracy drop. This thesis addresses these limitations by proposing a multi-level optimization approach, specifically focusing on array-level, row-level, and cell-level pruning techniques for energy-efficient weight mapping in IMC-based CNN inference
18:30 CEST PhDF.56 LEARN TO FLY : ENABLING DEEP LEARNING BASED PERCEPTION & CONTROL FOR AERIAL ROBOTS
Speaker and Author:
Veera Venkata Ram Murali Krishna Rao Muvva, University of Nebraska Lincoln, US
Abstract
Volants are extraordinary creatures. We can observe amazing phenomena in their flight, such as the precision of bald eagles in stormy conditions, the innate migration of Arctic terns without GNSS, the meticulous docking of hummingbirds, and the exceptional vision-based tracking of falcons. My career goal is to harness these extraordinary capabilities observed in volants, to overcome the current limitations in aerial systems. Volants achieve this intelligence not because they are experts in fluid dynamics, control systems, or computer vision; they succeed via a continuous learning process and the strong integration of perception and control modules. I believe that integrating traditional control theory and computer vision with machine and deep learning can offer solutions for designing and deploying robust aerial robots.
18:30 CEST PhDF.57 PERFORMANCE AND ENERGY EFFICIENT SECURE COMPUTING ON EDGE DEVICES
Speaker:
Ismet Dagli, Colorado School of Mines, US
Authors:
Ismet Dagli and Mehmet Belviranli, Colorado School of Mines, US
Abstract
As in-the-field computation demands increase, the use of more sophisticated heterogeneous System-on-Chips (SoCs) becomes more common in many edge devices. Advancing beyond monolithic single-processor architectures, SoCs have evolved to process a spectrum of computation through the integration of multiple domain-specific accelerators (DSAs). The architectural design choices, such as varying computation/power characteristics among these DSAs and CPUs, can further enhance the overall throughput and energy efficiency [17]. This approach enables collaborative execution, wherein tasks in a workload are dynamically mapped to the most efficient processing unit (PU) [15, 20, 10]. The total utilization of the system could be further improved by concurrently running tasks, e.g., layers in a deep neural network (DNN). Future computing systems are expected to scale either the number of accelerators by embedding more processor diversity in a computing device [18, 16] and the number of computing devices by connecting more computing devices and/or the cloud in many different domains such as federated learning and connected autonomous cars [14, 1]. Overall, efficient execution of modern edge/cloud workloads requires understanding their performance at both the SoC level and the system level, and should aim to improve three distinguishably critical considerations, energy, latency, throughput, and security: • Energy consumption is critical in autonomous and mobile domains where the deployment of machine learning tasks, particularly DNNs, incurs significant power consumption. Our study, published in DAC'22, optimizes energy consumption and is detailed in Section 2. • Minimizing computational latency is achievable by understanding performance bottlenecks at SoC and system levels. Our work, published in PPoPP'24 and received SRC finalist award in MICRO'22, optimizes latency and performance and is presented in Section 3. • Security vulnerabilities on shared memory attacks have become more and more common as we target to optimize the performance on resource-limited edge devices. Our work, accepted to DATE'25, investigates shared memory vulnerabilities as in Section 4. • Improving the total system throughput necessitates an in-depth analysis of resource utilization at the SoC- level and system-level. Our work, currently under submission to top-tier conference and also received SRC finalist award in CGO'24, proposes holistic resource management in Section 5.
18:30 CEST PhDF.58 ENHANCING QUANTUM CLOUD PERFORMANCE THROUGH ADVANCED TECHNIQUES
Presenter:
Tingting Li, Zhejiang University, CN
Authors:
Tingting Li1, Jianwei Yin2 and Liqiang Lu1
1Zhejiang University, CN; 2Zhejiang University,
Abstract
Quantum computing cloud services are composed of a complex ecosystem that integrates hardware, software, and network infrastructure to provide users with access to quantum computing resources over the internet. Quantum computing cloud services have emerged as a pivotal domain in the realm of computational technology, offering unprecedented computational capabilities that could revolutionize various industries. The concept refers to the provision of quantum computing resources over the cloud, allowing users to access and utilize quantum processors without the need for physical ownership. This thesis encompasses a series of optimizations from the hardware level to software services, aiming to enhance the efficiency and reliability of quantum cloud services. On the hardware side, we explore the use of Mixture of Experts (MoE) for the automatic calibration of superconducting quantum computers. On the software side, we investigate quantum serverless function orchestration for task allocation optimization. In terms of cloud service security, we explore quantum fingerprinting for cloud security using quantum task output. These efforts collectively contribute to the advancement of quantum cloud services, ensuring their robustness and security in the face of evolving computational demands.
18:30 CEST PhDF.59 EFFICIENT REFINEMENT OF HUMAN POSE ESTIMATION FOR INDUSTRY 5.0
Speaker:
Enrico Martini, Università di Verona, IT
Authors:
Enrico Martini and Nicola Bombieri, Università di Verona, IT
Abstract
This thesis addresses challenges in markerless Human Pose Estimation (HPE), including noise, occlusions, and computational constraints, by developing real-time filtering techniques that combine learned models with traditional methods. Key contributions include BeFine, a distributed 3D HPE industrial telemonitoring system that uses edge devices to capture multi-view poses and applies advanced filtering and clustering algorithms. In human-robot interaction, a filtering pipeline improves incomplete 3D poses from RGB-D cameras, mitigating occlusion effects and enabling collision prediction. This work enhances markerless motion capture, demonstrating its value in real-world applications.
18:30 CEST PhDF.60 LATTICE-BASED CRYPTOGRAPHY: BEYOND NIST STANDARDIZATION
Presenter:
Suparna Kundu, COSIC, KU Leuven, BE
Authors:
Suparna Kundu1, Ingrid Verbauwhede2 and Angshuman Karmakar3
1COSIC, KU Leuven, BE; 2KU Leuven, BE; 3IIT Kanpur, IN
Abstract
The National Institute of Standards and Technology (NIST) published the first set of post-quantum cryptographic standards in 2024. Although this is a significant step towards the transition from classical public-key cryptography (PKC) to post-quantum cryptography (PQC), several issues, such as new designs, lightweight implementations, physical attacks, and their countermeasures, need to be addressed before the widespread deployment of PQC in real-world applications. The primary focus of my thesis was to address some of these problems. My thesis bridged the gap between the theory and practice of PQC, especially lattice-based key-encapsulation mechanisms.

REC Reception

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 18:30 CEST - 20:00 CEST


Tuesday, 01 April 2025

ASD04 ASD regular session: Novel Safety Metrics, Adaptive Patterns for Resilience, and Legal Frameworks in Autonomous Systems Design

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST

Session chair:
Dirk Ziegenbein, Robert Bosch GmbH, DE

Session co-chair:
Rolf Ernst, TU Braunschweig, DE

This session discusses key design aspects for safety, adaptability, and legality of autonomous systems. First, a framework that utilizes cross-channel safety performance indicators (SPIs) to identify and tackle hazardous driving scenarios for automated vehicles and corresponding evidence from a proof-of-concept implementation is presented. The session continues with the introduction of the Reflex pattern, an innovative approach inspired by biological reflexes that enhances system resilience by dynamically responding to fluctuating resources, demonstrated through a drone image processing scenario. Lastly, the role of legal considerations in the design of automated vehicles is explored, especially those intended to transport intoxicated individuals, underscoring the need for a multidisciplinary collaboration among management, marketing, engineering, and legal teams to ensure the development of functionally robust and legally sound systems.

Time Label Presentation Title
Authors
08:30 CEST ASD04.1 IDENTIFICATION OF HAZARDOUS DRIVING SCENARIOS USING CROSS-CHANNEL SAFETY PERFORMANCE INDICATORS
Speaker:
Caspar Hanselaar, Eindhoven University of Technology, NL
Authors:
Caspar Hanselaar1, Murali Manohar Selva Kumar2, Yuting Fu2, Andrei Terechko2, Ranga Rao Venkatesha Prasad3 and Emilia Silvas1
1Eindhoven University of Technology, NL; 2NXP, N/; 3TU Delft, NL
Abstract
Automated Driving (AD) vehicles are slowly being deployed on public roads. These AD vehicles will encounter hazardous (dangerous) scenarios due to unforeseen test cases at design time and changing environments on the road after deployment. To allow developers of AD systems to mitigate such unforeseen risks, the safety of AD vehicles needs to be continuously monitored after deployment. To this end, the UL4600 standard and AVSC guidelines recommend the use of safety performance indicators (SPIs) by AD vehicle developers. Our paper presents a framework which uses SPIs to identify potentially hazardous scenarios specific to the evaluated AD system, covering both AD vehicles and cloud operations. The framework uses the perception systems and motion plans of heterogeneous redundant multi-channel architectures to detect hazards invisible in single-channel-based systems. We propose three cross-channel SPIs and use them to identify hazardous scenarios in the AD vehicle and validate this approach with a proof-of-concept implementation. In a test of 6 challenging routes in the CARLA simulator our framework automatically identifies 86% of hazardous situations. Next it identifies contributing issues in the AD vehicle, such as missed object detections or dangerous planned trajectories. With this proof of concept, we show that this framework provides evidence on the safety of deployed systems, identifies AD vehicle functions in need of improvement and provides lessons for the development of future AD systems.
09:00 CEST ASD04.2 DESIGNING RESILIENT AUTONOMOUS SYSTEMS WITH THE REFLEX PATTERN
Speaker:
Julian Demicoli, TU Munich, DE
Authors:
Julian Demicoli and Sebastian Steinhorst, TU Munich, DE
Abstract
Autonomous systems face significant challenges due to fluctuating resources and unstable environments, where traditional redundancy strategies for resilience can be inefficient. We present the Reflex pattern, inspired by biological reflexes, promoting system resilience by dynamically adapting to changing resource conditions. By switching between complex and resource-efficient algorithms based on availability, the pattern optimizes efficient resource utilization without extensive redundancy, ensuring essential functionalities remain operational under constraints. To facilitate adoption, we introduce ReflexLang, a domain-specific language (DSL) enabling automated code generation for reflex-pattern-based systems. We validate the pattern's effectiveness in a drone image processing scenario, demonstrating its potential to enhance operational integrity and resilience.
09:30 CEST ASD04.3 LAW AS A DESIGN CONSIDERATION FOR AUTOMATED VEHICLES SUITABLE TO TRANSPORT INTOXICATED PERSONS
Speaker:
William Widen, University of Miami, US
Authors:
Marilyn Wolf1 and William Widen2
1University of Nebraska, US; 2University of Miami, N/
Abstract
This essay explains why an automated vehicle (AV) manufacturer should consider law during the design process for an AV intended as "fit-for-purpose" to transport intoxicated persons. It suggests that management, marketing, engineering, and legal functions collaborate to develop product requirements and specifications that shield owner/occupants from criminal liability for DUI manslaughter and negligent homicide, as well as guard against civil liability.

BPA04 BPA Session 4

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST

Time Label Presentation Title
Authors
08:30 CEST BPA04.1 A LIGHTWEIGHT CNN FOR REAL-TIME PRE-IMPACT FALL DETECTION
Speaker:
Cristian Turetta, Università di Verona, IT
Authors:
Cristian Turetta1, Muhammed Ali1, Florenc Demrozi2 and Graziano Pravadelli1
1Università di Verona, IT; 2Department of Electrical Engineering and Computer Science, University of Stavanger, NO
Abstract
Falls can have significant and far-reaching effects on various groups, particularly the elderly, workers, and the general population. These effects can impact both physical and psychological well-being, leading to long-term health problems, reduced productivity, and a decreased quality of life. Numerous fall detection systems have been developed to prompt first aid in the event of a fall and reduce its impact on people's lives. However, detecting a fall after it has occurred is insufficient to mitigate its consequences, such as trauma. These effects can be further minimized by activating safety systems (e.g., wearable airbags) during the fall itself, specifically in the pre- impact phase, to reduce the severity of the impact when hitting the ground. Achieving this, however, requires recognizing the fall early enough to provide the necessary time for the safety system to become fully operational before impact. To address this challenge, this paper introduces a novel lightweight convolutional neural network (CNN) designed to detect pre-impact falls. The proposed model overcomes the limitations of current solutions regarding deployability on resource-constrained embedded devices, specifically for controlling the inflation of an airbag jacket. We extensively tested and compared our model, deployed on an STM32F722 microcontroller, against state-of-the-art approaches using two different datasets.
08:50 CEST BPA04.2 COCKTAIL: CHUNK-ADAPTIVE MIXED-PRECISION QUANTIZATION FOR LONG-CONTEXT LLM INFERENCE
Speaker:
Wei Tao, Huazhong University of Science and Technology, CN
Authors:
Wei Tao1, Bin Zhang1, Xiaoyang Qu2, Jiguang Wan1 and Jianzong Wang3
1Huazhong University of Science and Technology, CN; 2Ping An Technology (shenzhen)Co., Ltd, CN; 3Ping An Technology, CN
Abstract
Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets. Our code is presented on https://github.com/Sullivan12138/Cocktail.
09:10 CEST BPA04.3 RANKMAP: PRIORITY-AWARE MULTI-DNN MANAGER FOR HETEROGENEOUS EMBEDDED DEVICES
Speaker:
Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US
Authors:
Andreas Karatzas1, Dimitrios Stamoulis2 and Iraklis Anagnostopoulos1
1Southern Illinois University Carbondale, US; 2The University of Texas at Austin, US
Abstract
Modern edge data centers simultaneously handle multiple Deep Neural Networks (DNNs), leading to significant challenges in workload management. Thus, current management systems need to leverage the architectural heterogeneity of new embedded systems, enabling efficient handling of multi-DNN workloads. This paper introduces RankMap, a priority-aware manager specifically designed for multi-DNN tasks on heterogeneous embedded devices. RankMap addresses the extensive solution space of multi-DNN mapping through stochastic space exploration combined with a performance estimator. Experimental results show that RankMap achieves x3.6 higher average throughput compared to existing methods, while effectively preventing DNN starvation under heavy workloads and improving the prioritization of specified DNNs by x57.5.

ET02 Securing the Future: Designing Built-in-Security Enabled Photonic AI Chip

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST


FS05 Focus Session - 3D Integration, Cryogenic Circuits and Superconducting Logic: Emerging Trends Shaping the Future of High-Performance Computing

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST

Session chair:
Ahmedullah Aziz, University of Tennessee Knoxville, US

Session co-chair:
Hussam Amrouch, TU Munich, DE

Organiser:
Hussam Amrouch, TU Munich, DE

As CMOS scaling approaches its fundamental limits, the explosive rise of artificial intelligence (AI) and large language models (LLMs) is exposing profound challenges in today’s computing architectures. The immense demand for memory, speed, and energy efficiency is pushing classical chips to their breaking point. This focus session will explore three transformative trends that are poised to redefine the future of high-performance computing and address the escalating challenges of AI-driven workloads. The first trend is 3D integration, an innovative paradigm that allows memory layers to be fabricated in the back end of line (BEOL), dramatically increasing on-chip memory capacity. Of the emerging memory technologies, ferroelectric memories stand out as particularly promising due to their compatibility with BEOL CMOS and their low-power, high-density operation. The second trend, cryogenic CMOS, leverages the advantages of operating circuits at cryogenic temperatures (77K and below), significantly enhancing transistor performance with steeper subthreshold slopes, higher on-currents, and lower off-currents, delivering remarkable speed and efficiency gains. The last trend is superconducting logic, which is set to revolutionize computing by achieving zero resistance, unlocking unprecedented levels of speed and energy efficiency. Our session brings together leading experts from both industry and academia to present cutting-edge solutions that are already reshaping the semiconductor landscape. Attendees will gain a deep, comprehensive understanding of the emerging trends that will drive the next generation of high-performance computing, providing a critical window into the chips that will fuel future advances in AI.

Time Label Presentation Title
Authors
08:30 CEST FS05.1 PUSHING THE BOUNDARIES OF AI CHIPS: FROM MONOLITHIC 3D CMOS TO CRYOGENIC COMPUTING
Speaker:
Hussam Amrouch, TU Munich (TUM), DE
Authors:
Mahdi Benkhelifa1, Shivendra Parihar2, Anirban Kar1, Girish Pahwa3, Yogesh Chauhan4 and Hussam Amrouch5
1TU Munich, DE; 2University of California, Berkeley, US; 3National Yang Ming Chiao Tung University, TW; 4IIT Kanpur, IN; 5TU Munich (TUM), DE
Abstract
As CMOS scaling approaches its fundamental limits, the explosiverise of AI and LLMs has unveiled profound bottlenecks in computing architectures. This talk presents two groundbreaking paradigms poised to reshape the landscape of high-performance computing and meet the surging demands of AI-driven workloads. The first paradigm is 3D monolithic integration, a revolutionary approach that achieves unprecedented logic density through Complementary FETs (CFETs), where pMOS and nMOS transistors are vertically stacked, and a dramatic expansion of on-chip memory capacity by integrating memory layers atop logic transistors. The second paradigm leverages the transformative potential of operating chips at cryogenic temperatures—specifically around 77K—where transistors exhibit significantly enhanced performance, and parasitic resistances are substantially minimized. These advancements hold the promise of redefining computing efficiency and performance for the AI era.
08:53 CEST FS05.2 TRANSISTOR AGING AND CIRCUIT RELIABILITY AT CRYOGENIC TEMPERATURES
Speaker:
Vishal Nayar, imec, BE
Authors:
Javier Fortuny and Vishal Nayar, IMEC, BE
Abstract
The increasing interest in cryogenic circuits is driven by their transformative potential across high-performance computing, medical devices, space exploration, and quantum technologies. Operating transistors at cryogenic temperatures, such as 77 K and below, yields substantial improvements, including increased ON current, reduced OFF current, and enhanced sub-threshold slope. While recent studies have explored device-level reliability at cryogenic temperatures, circuit-level reliability—particularly under bias temperature instability (BTI)—remains underexamined, leaving critical aging mechanisms at these temperatures not well understood. To bridge this gap, we designed and fabricated a customized chip in a commercial HKMG 28 nm technology. The chip integrates several ring oscillator (RO) circuits for precise characterization of accelerated aging effects, enabling evaluation of their impact on performance at cryogenic temperatures. Finally, we project technology degradation in a 10-year future comparing, the achieved wear out between room temperature (298 K) and at 77 K when operating circuits at the nominal voltage, revealing the significant mitigation of BTI aging when operating at affordable cryogenic temperatures.
09:15 CEST FS05.3 FERROELECTRIC-SUPERCONDUCTING SYNERGY FOR FUTURE COMPUTING
Speaker:
Ahmedullah Aziz, University of Tennessee Knoxville, US
Authors:
Shamiul Alam1 and Ahmedullah Aziz2
1University of Tennessee Knoxville, US; 2University of Tennessee, Knoxville, US
Abstract
Ferroelectric Superconducting Quantum Interference Devices (Fe-SQUIDs) have recently gained attention as a transformative technology for superconducting computing, offering voltage-controlled switching that is essential for large-scale digital circuits. This unique technology has the potential to drive advancements in cryogenic computing by enabling scalable memory systems and voltage-controlled logic circuits. These innovations are critical for the realization of large-scale quantum computers and hold significant promise for high-performance computing and space exploration. In this article, we explore how Fe-SQUIDs, integrated with heater cryotrons (hTrons), can be harnessed to develop key components of computing systems. These include non-volatile memory, voltage-controlled logic circuits, in-memory matrix-vector multiplication systems, and ternary content-addressable memory. We also examine how changes in the key characteristics of Fe-SQUIDs and hTrons influence the performance of these applications, providing insights into the design and optimization of next-generation superconducting hardware.
09:38 CEST FS05.4 MATERIAL-TO-SYSTEM CO-OPTIMIZATION FOR ADVANCED SEMICONDUCTOR MANUFACTURING
Presenter:
Gaurav Thareja, Applied Materials, US
Author:
Gaurav Thareja, Applied Materials, US
Abstract
The exponential growth of AI is tied to groundbreaking advancements in semiconductor technology, driven by the PPACt metrics: low Power, high Performance, reduced Area, low Cost, and faster Time to market. Traditionally, achieving these metrics has required years—often decades—of meticulous semiconductor innovation, progressing from concept to high-volume manufacturing. This process unfolds through four critical phases: materials discovery, process optimization, device engineering, and chip design. In this talk, we will explore how ML-driven methods are revolutionizing the semiconductor industry, significantly accelerating progress across all stages of development. We will highlight the key discoveries necessary for enabling novel materials that power cryogenic circuits, ferroelectric memories, and 3D integration.

HSD01 HackTheSilicon DATE

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 12:30 CEST


SD01 Special Day on AI and ML Trends

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST

This Special Day focuses on exploring the latest trends and innovations in Artificial Intelligence (AI) and Machine Learning (ML) in the context of DATE. As AI (and mainly generative AI) is booming, especially since the release of chat-GPT, we expect AI/ML will change the way to approach Design, Automation, and Test. In this context, field experts will present their thoughts on the challenges and opportunities of AI/ML, and will engage the audience in an open discussion about the trends that the DATE community should pursue.

This Special Day will highlight the following topics:
* Design of hardware architectures and software, including automatic exploration of large design spaces, assistance of the human designer, resource selection and optimization
* Verification of hardware architectures, with topics such as performance prediction, (formal) design validation, accelerating simulations thanks to AI-Augmented Surrogate Models
* AI-Accelerated Physical Design and Validation of layout and floorplans
* New AI accelerators architectures

These topics will be addressed by a lineup of six distinguished speakers, experts in their respective fields. The day will conclude with a panel discussion allowing experts and the audience to engage in an informal exchange of ideas and trigger discussions on the future research directions and/or the interaction between the various domains presented during the day.

This special day is the ideal even for AI/ML researchers, data scientists, hardware designers, software developers, sustainability advocates, and anyone interested in the future directions of AI and ML for Design, Automation and Test.

Time Label Presentation Title
Authors
08:30 CEST SD01.1 INTRODUCTION TO THE SPECIAL DAY
Presenter:
Ana Lucia Varbanescu, University of Amsterdam, NL
Authors:
Ana Lucia Varbanescu1 and Marc Duranton2
1University of Amsterdam, NL; 2CEA, FR
Abstract
.
08:45 CEST SD01.2 TBD
Presenter:
David Z. Pan, The University of Texas at Austin, US
Author:
David Z. Pan, The University of Texas at Austin, US
Abstract
.
09:15 CEST SD01.3 TBD
Presenter:
Tobias Becker, Groq, DE
Author:
Tobias Becker, Groq, DE
Abstract
.
09:45 CEST SD01.4 TBD
Presenter:
Adolfy Hoisie, Brookaven National Laboratory, US
Author:
Adolfy Hoisie, Brookaven National Laboratory, US
Abstract
.

TS06 Design Automation for Quantum Computing

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST

Time Label Presentation Title
Authors
08:30 CEST TS06.1 EMPOWERING QUANTUM ERROR TRACEABILITY WITH MOE FOR AUTOMATIC CALIBRATION
Speaker:
Tingting Li, Zhejiang University, CN
Authors:
Tingting Li1, Ziming Zhao1, Liqiang Lu1, Siwei Tan2 and Jianwei Yin1
1Zhejiang University, CN; 2Zhejiang university, CN
Abstract
Quantum computing offers the potential for exponential speedups over classical computing in tackling complex tasks, such as large-number factorization and chemical molecular simulation. However, quantum noise remains a significant challenge, hindering the reliability and scalability of quantum systems. Therefore, effective characterization and calibration of quantum noise are critical to advancing these systems. Quantum calibration is a process that heavily relies on expert knowledge, and there currently is a range of research focused on automatic calibration. However, traditional calibration methods often need an effective error traceback mechanism, leading to repeated calibration attempts without identifying root causes. To address the issue of error traceback in calibration failures, this paper proposes an automatic calibration error traceback algorithm facilitated by a Mixture of Experts (MoE) system inspired by the current large language model technologies. Our approach enables traceability of quantum calibration errors, allowing for the rapid identification and correction of deviations from the calibration state. Extensive experimental results demonstrate that the MoE-based automatic calibration method significantly outperforms traditional error traceability and calibration efficiency techniques. Notably, our approach improved the average visibility of 77 qubits by 25.5%, surpassing the outcomes of fixed calibration processes. This work presents a promising path toward more reliable and scalable quantum computing systems.
08:35 CEST TS06.2 OPTIMAL STATE PREPARATION FOR LOGICAL ARRAYS ON ZONED NEUTRAL ATOM QUANTUM COMPUTERS
Speaker:
Yannick Stade, TU Munich, DE
Authors:
Yannick Stade, Ludwig Schmid, Lukas Burgholzer and Robert Wille, TU Munich, DE
Abstract
Quantum computing promises to solve problems previously deemed infeasible. However, high error rates necessitate quantum error correction for practical applications. Seminal experiments with zoned neutral atom architectures have shown remarkable potential for fault-tolerant quantum computing. To fully harness their potential, efficient software solutions are vital. A key aspect of quantum error correction is the initialization of physical qubits representing a logical qubit in a highly entangled state. This process, known as state preparation, is the foundation of most quantum error correction codes and, hence, a crucial step towards fault-tolerant quantum computing. Generating a schedule of target-specific instructions to perform the state preparation is highly complex. First software tools exist but are not suitable for the zoned neutral atom architectures. This work addresses this gap by leveraging the computational power of SMT solvers and generating minimal schedules for the state preparation of logical arrays. Experimental evaluations demonstrate that actively utilizing zones to shield idling qubits consistently results in higher fidelities than solutions disregarding these zones. The complete code is publicly available in open-source as part of the Munich Quantum Toolkit (MQT) at https://github.com/cda-tum/mqt-qmap.
08:40 CEST TS06.3 DESIGN OF AN FPGA-BASED NEUTRAL ATOM REARRANGEMENT ACCELERATOR FOR QUANTUM COMPUTING
Speaker:
Xiaorang Guo, TU Munich, DE
Authors:
Xiaorang Guo, Jonas Winklmann, Dirk Stober, Amr Elsharkawy and Martin Schulz, TU Munich, DE
Abstract
Neutral atoms have emerged as a promising technology for implementing quantum computers due to their scalability and long coherence times. However, the execution frequency of neutral atom quantum computers is constrained by image processing procedures, particularly the assembly of defect-free atom arrays, which is a crucial step in preparing qubits (atoms) for execution. To optimize this assembly process, we propose a novel quadrant-based rearrangement algorithm that employs a divide-and-conquer strategy and also enables the simultaneous movement of multiple atoms, even across different columns and rows. We implement the algorithm on Field Programmable Gate Arrays (FPGAs) to handle each quadrant independently (hardware-level optimization) while maximizing parallelization. To the best of our knowledge, this is the first hardware acceleration work for atom rearrangement, and it significantly reduces processing time. This achievement also contributes to the ongoing efforts of tightly integrating quantum accelerators into High-Performance Computing (HPC) systems. Tested on a Zynq RFSoC FPGA at 250 MHz, our hardware implementation is able to complete the rearrangement process of a 30×30 compact target array, derived from a 50×50 initial loaded array, in approximately 1.0 μs. Compared to a comparable CPU implementation and to state-of-the-art FPGA work, we achieved about 54× and 300× speedups in the rearrangement analysis time, respectively. Additionally, the FPGA-based acceleration demonstrates good scalability, allowing for seamless adaptation to varying sizes of the atom array, which makes this algorithm a promising solution for large-scale quantum systems.
08:45 CEST TS06.4 IMAGE COMPUTATION FOR QUANTUM TRANSITION SYSTEMS
Speaker:
Xin Hong, Institute of Software, Chinese Academy of Sciences, CN
Authors:
Xin Hong1, Dingchao Gao1, Sanjiang Li2, Shenggang Ying1 and Mingsheng Ying3
1Institute of Software, Chinese Academy of Sciences, CN; 2UTS, AU; 3University of Technology Sydney, CN
Abstract
With the rapid progress in quantum hardware and software, the need for verification of quantum systems becomes increasingly crucial. While model checking is a dominant and very successful technique for verifying classical systems, its application to quantum systems is still an underdeveloped research area. This paper advances the development of model checking quantum systems by providing efficient image computation algorithms for quantum transition systems, which play a fundamental role in model checking. In our approach, we represent quantum circuits as tensor networks and design algorithms by leveraging the properties of tensor networks and tensor decision diagrams. Our experiments demonstrate that our contraction partition-based algorithm can greatly improve the efficiency of image computation for quantum transition systems.
08:50 CEST TS06.5 LOW-LATENCY DIGITAL FEEDBACK FOR STOCHASTIC QUANTUM CALIBRATION USING CRYOGENIC CMOS
Speaker:
Nathan Miller, Georgia Tech, US
Authors:
Nathan Miller, Laith Shamieh and Saibal Mukhopadhyay, Georgia Tech, US
Abstract
In order to develop quantum computing systems towards practically useful applications, their physical quantum bits (qubits) must be able to operate with minimal error. Recent work has demonstrated stochastic gate calibration protocols for quantum systems which are meant to track drifting control parameters and tune gate operations to high fidelity. These protocols critically rely on low-latency feedback between the quantum system and its classical control hardware, which is impossible without on-board classical compute from FPGAs or ASICs. In this work, we analyze the performance of a single-shot stochastic calibration protocol for indefinite outcome quantum circuits under various latency conditions based on timing considerations from experimental quantum systems. We also demonstrate the benefits that can be achieved with ASIC implementation of the protocol by synthesizing the classical control logic in a 28 nm CMOS design node, with simulations extended to 14 nm FinFET and at both room and cryogenic temperatures. We show that these classes of quantum calibration protocols can be easily implemented within contemporary control system architectures for low-latency performance without significant power or resource utilization, allowing for the rapid tuning and drift control of any gate-model quantum system towards fault-tolerant computation.
08:55 CEST TS06.6 IMPROVING FIGURES OF MERIT FOR QUANTUM CIRCUIT COMPILATION
Speaker:
Patrick Hopf, TU Munich, DE
Authors:
Patrick Hopf1, Nils Quetschlich1, Laura Schulz2 and Robert Wille1
1TU Munich, DE; 2Leibniz Supercomputing Centre, DE
Abstract
Quantum computing is an emerging technology that has seen significant software and hardware improvements in recent years. Executing a quantum program requires the compilation of its quantum circuit for a target Quantum Processing Unit (QPU). Various methods for qubit mapping, gate synthesis, and optimization of quantum circuits have been proposed and implemented in compilers. These compilers try to generate a quantum circuit that leads to the best execution quality - a criterium which is usually approximated by figures of merit such as the number of (two-qubit) gates, the circuit depth, expected fidelity, or estimated success probability. However, it is often unclear how well these figures of merit represent the actual execution quality on a QPU. In this work, we investigate the correlation between established figures of merit and actual execution quality on real machines - revealing that the correlation is weaker than anticipated and that more complex figures of merit are not necessarily more accurate. Motivated by this finding, we propose an improved figure of merit (based on a machine learning approach) that can be used to predict the expected execution quality of a quantum circuit for a chosen QPU without actually executing it. The employed machine learning model reveals the influence of various circuit features on generating high correlation scores. The proposed figure of merit demonstrates a strong correlation and outperforms all previous ones in a case study - achieving an average correlation improvement of 49%.
09:00 CEST TS06.7 DETERMINISTIC FAULT-TOLERANT STATE PREPARATION FOR NEAR-TERM QUANTUM ERROR CORRECTION: AUTOMATIC SYNTHESIS USING BOOLEAN SATISFIABILITY
Speaker:
Ludwig Schmid, TU Munich, DE
Authors:
Ludwig Schmid1, Tom Peham1, Lucas Berent1, Markus Müller2 and Robert Wille1
1TU Munich, DE; 2RWTH Aachen University, DE
Abstract
To ensure resilience against the unavoidable noise in quantum computers, quantum information needs to be encoded using an error-correcting code, and circuits must have a particular structure to be fault-tolerant. Compilation of fault-tolerant quantum circuits is thus inherently different from the non-fault-tolerant case. However, automated fault-tolerant compilation methods are widely underexplored, and most known constructions are obtained manually for specific codes only. In this work, we focus on the problem of automatically synthesizing fault-tolerant circuits for the deterministic initialization of an encoded state for a broad class of quantum codes that are realizable on current and near-term hardware. To this end, we utilize methods based on techniques from classical circuit design, such as satisfiability solving, resulting in tools for the synthesis of (optimal) fault-tolerant state preparation circuits for near-term quantum codes. We demonstrate the correct fault-tolerant behavior of the synthesized circuits using circuit-level noise simulations. We provide all routines as open-source software as part of [retracted for double-blind review] for general use and to foster research in fault-tolerant circuit synthesis.
09:05 CEST TS06.8 OPTIMIZING QUBIT ASSIGNMENT IN MODULAR QUANTUM SYSTEMS VIA ATTENTION-BASED DEEP REINFORCEMENT LEARNING
Speaker:
Enrico Russo, University of Catania, IT
Authors:
Enrico Russo, Maurizio Palesi, Davide Patti, Giuseppe Ascia and Vincenzo Catania, University of Catania, IT
Abstract
Modular, distributed, and multi-core architectures are considered a promising solution for scaling quantum computing systems. Optimising communication is crucial to preserve quantum coherence. The compilation and mapping of quantum circuits should minimise state transfers while adhering to architectural constraints. To address this problem efficiently, we propose a novel approach using Reinforcement Learning (RL) to learn heuristics for a specific multi-core architecture. Our RL agent uses a Transformer encoder and Graph Neural Networks, encoding quantum circuits with self-attention and producing outputs via an attention-based pointer mechanism to match logical qubits with physical cores efficiently. Experimental results show our method outperform the baseline reducing by 28% inter-core communications for random circuits while minimising time-to-solution.
09:10 CEST TS06.9 NEURAL CIRCUIT PARAMETER PREDICTION FOR EFFICIENT QUANTUM DATA LOADING
Speaker:
Dohun Kim, Pohang University of Science and Technology, KR
Authors:
Dohun Kim, Sunghye Park and Seokhyeong Kang, Pohang University of Science and Technology, KR
Abstract
Quantum machine learning (QML) has demonstrated the potential to outperform classical machine learning algorithms in various fields. However, encoding classical data into quantum states, known as quantum data loading, remains a challenge. Existing methods achieve high accuracy in loading single data, but lack efficiency for large-scale data loading tasks. In this work, we propose Neural Circuit Parameter Prediction, a novel method that leverages classical deep neural networks to predict the parameters of parameterized quantum circuits directly from the input data. This approach benefits from the batch inference capability of neural networks and improves the accuracy of quantum data loading. We introduce real-valued parameterization of quantum circuits and a three-phase training strategy to further enhance training efficiency and accuracy. Experimental results on MNIST dataset show that our method achieves a 17.31% improvement in infidelity score and 108 times faster runtime compared to existing methods. Our approach provides an efficient solution for quantum data loading, enabling the practical deployment of QML algorithms on large-scale datasets
09:15 CEST TS06.10 CIM-BASED PARALLEL FULLY FFNN SURFACE CODE HIGH-LEVEL DECODER FOR QUANTUM ERROR CORRECTION
Speaker:
Hao Wang, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Hao Wang1, Erjia Xiao1, Songhuan He2, Zhongyi Ni1, Lingfeng Zhang1, Xiaokun Zhan3, Yifei Cui2, Jinguo Liu1, Cheng WANG2, Zhongrui Wang4 and Renjing Xu1
1The Hong Kong University of Science and Technology (Guangzhou), CN; 2University of Electronic Science and Technology of China, CN; 3Harbin Institute of Technology, CN; 4Southern University of Science and Technology, CN
Abstract
In all types of surface code decoders, fully neural network-based high-level decoders offer decoding thresholds that surpass decoder-Minimum Weight Perfect Matching (MWPM), and exhibit strong scalability, making them one of the ideal solutions for addressing surface code challenges. However, current fully neural network-based high-level decoders can only operate serially and do not meet the current latency requirements (below 440 ns). To address these challenges, we first propose a parallel fully feedforward neural network (FFNN) high-level surface code decoder, and comprehensively measure its decoding performance on a computing-in-memory (CIM) hardware simulation platform. With the currently available hardware specifications, our work achieves a decoding threshold of 14.22%, and achieves high pseudo-thresholds of 10.4%, 11.3%, 12%, and 11.6% with decoding latencies of 197.03 ns, 234.87 ns, 243.73 ns, and 251.65 ns for distances of 3, 5, 7 and 9, respectively.

TS07 Applications of emerging technologies

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST

Time Label Presentation Title
Authors
08:30 CEST TS07.1 HYPERDYN: DYNAMIC DIMENSIONAL MASKING FOR EFFICIENT HYPER-DIMENSIONAL COMPUTING
Speaker:
Fangxin Liu, Shanghai Jiao Tong University, CN
Authors:
Fangxin Liu, Haomin Li, Zongwu Wang, Dongxu Lyu and Li Jiang, Shanghai Jiao Tong University, CN
Abstract
Hyper-dimensional computing (HDC) is a bio-inspired computing paradigm that mimics cognitive tasks by encoding data into high-dimensional vectors and employing non-complex learning techniques. However, existing HDC solutions face a major challenge hindering their deployment on low-power embedded devices: the costly associative search module, especially in high-precision computations. This module involves calculating the distance between class vectors and query vectors, as well as sorting distances In this paper, we present HyperDyn, an efficient dynamic inference framework designed for accurate and efficient hyper-dimensional computing. Our framework first offline analyzes the importance of different dimensions in the associative memory based on the contributions of the dimensions to the classification accuracy. In addition, we introduce a dynamic dimensional importance scaling mechanism for more flexible and accurate dimension contribution judgments. Finally, HyperDyn achieves efficient dynamic associative search through a dimension masking mechanism that adapts to the characteristics of the input sample. We evaluate HyperDyn on datasets from three different fields and the results show that HyperDyn can achieve $7.65 imes$ speedup and $58\%$ energy savings, with less than $0.2\%$ loss in accuracy.
08:35 CEST TS07.2 C3CIM: CONSTANT COLUMN CURRENT MEMRISTOR-BASED COMPUTATION-IN-MEMORY MICRO-ARCHITECTURE
Speaker:
Yashvardhan Biyani, TU Delft, NL
Authors:
Yashvardhan Biyani, Rajendra Bishnoi, Said Hamdioui and Theofilos Spyrou, TU Delft, NL
Abstract
Advancements in Artificial Intelligence (AI) and Internet-of-Things (IoT) have increased demand for edge AI, but deployment on traditional AI accelerators, like GPUs and TPUs, using von Neumann architecture, suffer from inefficiencies due to separate memory and compute units. Computation-in-Memory (CIM), utilizing non-volatile memristor devices to leverage analog computing principles and perform in-place computations, holds great potential in improving computational efficiency by eliminating frequent data movement. However, standard implementation of CIM faces several challenges, primarily high power consumption and subsequently induced non-linearity, debating its viability for edge devices. In this paper, we propose C3CIM, a novel memristor-based CIM micro-architecture, featuring a new bit-cell and array design, targeting efficient implementation of Neural Networks (NN). Our architecture uses a constant current source to perform Multiply-and-Accumulate (MAC) operations with a very low computation current (10 to 100 nA), thereby significantly enhancing power efficiency. We adapted C3CIM for Spiking Neural Networks (SNN) and developed a prototype using TSMC 40nm CMOS node for on-silicon validation. Furthermore, our micro-architecture was benchmarked using two SNN models based on N-MNIST and IBM-Gesture datasets, for comparison against current state-of-the-art (SOTA). Results show up to 35x reduction in power along with 6.7x saving in energy compared to SOTA, demonstrating promising potential of this work for edge AI applications.
08:40 CEST TS07.3 ASNPC: AN AUTOMATED GENERATION FRAMEWORK FOR SNN AND NEUROMORPHIC PROCESSOR CO-DESIGN
Speaker:
Xiangyu Wang, National University of Defense Technology, CN
Authors:
Xiangyu Wang1, Yuan Li2, Zhijie Yang3, Chao Xiao1, Xun Xiao1, Renzhi Chen4, Weixia Xu1 and Lei Wang3
1National University of Defense Technology, CN; 2College of Computer, National University of Defense Technology, CN; 3Academy of Military Sciences, CN; 4qiyuan laboratory, CN
Abstract
Spiking neural networks (SNNs) are promisingly considered as energy-efficient alternatives to traditional deep neural networks. At the same time, neuromorphic processors have garnered increasing attention to support the efficient execution of SNNs. However, current works always separate their design to primarily prioritize a single criterion. Hardware-algorithm co-design allows for the simultaneous consideration of hardware and algorithm characteristics during the design process, effectively reducing resource usage while optimizing the algorithm's performance. In light of this, we developed a hardware-algorithm co-design framework named ASNPC for SNNs and neuromorphic processors. Considering the vast mixed-variable co-design space and the time-expensive function evaluations, we employed the surrogate-based multi-objective optimization algorithm MOTPE to identify Pareto solutions that balance algorithm performance and hardware costs. To rapidly obtain hardware results, we designed an end-to-end methodology that can automatically generate the Register-Transfer Level (RTL) code for neuromorphic processors corresponding to each candidate using templates from the hardware library. The evaluated hardware metrics, such as hardware resource and power consumption, are then fed back to MOTPE for the next candidate selection. Compared to existing works, the proposed approach exhibits the ability to find better Pareto solutions, balancing hardware costs and accuracy within a limited search budget, making it widely applicable to various application scenarios. Additionally, under the same hardware configuration, the neuromorphic processor we generated achieves lower hardware resource usage and higher throughput.
08:45 CEST TS07.4 SIMULTANEOUS DENOISING AND COMPRESSION FOR DVS WITH PARTITIONED CACHE-LIKE SPATIOTEMPORAL FILTER
Speaker:
Qinghang Zhao, Xidian University, CN
Authors:
Qinghang Zhao, Yixi Ji, Jiaqi Wang, Jinjian Wu and Guangming Shi, Xidian University, CN
Abstract
Dynamic vision sensor (DVS) is a novel neuromorphic imaging device that asynchronously generates event data corresponding to changes in light intensity at each pixel. However, the differential imaging paradigm of DVS renders it highly sensitive to background noise. Additionally, the substantial volume of event data produced in a very short time presents significant challenges for data transmission and processing. In this work, we present a novel spatiotemporal filter design, named PCLF, to achieve simultaneous denoising and compression for the first time. The PCLF employs a hierarchical memory structure that utilizes symmetric multi-bank cache-like row and column memories to store event data from a partitioned pixel array, which exhibits low memory complexity of O(m+n) for an m×n DVS. Furthermore, we propose a probability-based criterion to effectively control the compression ratio. We have implemented our design on an FPGA, demonstrating capabilities for real-time operation (≤60 ns) and low power consumption (<200 mW). Extensive experiments conducted on real-world DVS data across various tasks indicate that our design enables a reduction of event data by 30% to 68%, while maintaining or even enhancing the performance of the tasks.
08:50 CEST TS07.5 PRACTICAL MU-MIMO DETECTION AND LDPC DECODING THROUGH DIGITAL ANNEALING
Speaker:
Po-Shao Chen, University of Michigan, US
Authors:
Po-Shao Chen, Wei Tang and Zhengya Zhang, University of Michigan, US
Abstract
Digital annealing has been successfully applied to solving combinatorial optimization (CO) problems. It is more flexible, robust, and easier to deploy on edge platforms compared to its counterparts including quantum annealing and analog and in-memory Ising machines. In this work, we apply digital annealing to compute-intensive communication digital signal processing problems, including multi-user detection in multiple-input and multiple-output (MU-MIMO) wireless communication systems and decoding low-density parity-check (LDPC) codes. We show that digital annealing can achieve near maximum likelihood (ML) accuracy for MIMO detection with even lower complexity than the conventional minimum mean square error (MMSE) detection. In LDPC decoding, we enhance digital annealing by introducing a new cost function that improves decoding accuracy and reduces computational complexity compared to the standard formulations.
08:55 CEST TS07.6 LLM-SRAF: SUB-RESOLUTION ASSIST FEATURE GENERATION USING LARGE LANGUAGE MODEL
Speaker:
Tianyi Li, ShanghaiTech University, Pl
Authors:
Tianyi Li1, Zhexin Tang1, Tao Wu1, Bei Yu2, Jingyi Yu1 and Hao Geng1
1ShanghaiTech University, CN; 2The Chinese University of Hong Kong, HK
Abstract
As integrated circuit (IC) feature sizes continue to shrink, using sub-resolution assist features (SRAF) becomes increasingly crucial for improving wafer pattern resolution and fidelity. However, model-based SRAF insertion techniques, while accurate, require substantial computational resources and are often impractical for industrial scenarios. This demands more efficient and industry-compatible methods that maintain high performance. In this work, we introduce LLM-SRAF, a novel framework for SRAF generation driven by a large language model fine-tuned on an SRAF dataset. LLM-SRAF accepts semantic prompt inputs, including SRAF generation task descriptions, OPC recipe, lithography conditions, mask rules, and sequential layout descriptions, to directly generate SRAFs. Both supervised fine-tuning and reinforcement learning with human feedback (RLHF) are employed to enable the model to acquire domain-specific knowledge and specialize in SRAF generation. Experimental results show that LLM-SRAF outperforms existing state-of-the-art methods in metrics of mask quality, including edge placement error (EPE) and process variation band (PVB) area. Moreover, the runtime of LLM-SRAF is also 3x faster compared to the Calibre commercial tool.
09:00 CEST TS07.7 A MULTI-STAGE POTTS MACHINE BASED ON COUPLED CMOS RING OSCILLATORS
Speaker:
Yilmaz Ege Gonul, Drexel University, US
Authors:
Yilmaz Gonul and Baris Taskin, Drexel University, US
Abstract
This work presents a multi-stage coupled ring oscillator based Potts machine, designed with phase-shifted Sub-Harmonic-Injection-Locking (SHIL) to represent multivalued Potts spins at different solution stages with oscillator phases. The proposed Potts machine is able to solve a certain class of combinatorial optimization problems that natively require multivalued spins with a divide-and-conquer approach, facili tated through the alternating phase-shifted SHILs acting on the oscillators. The proposed architecture eliminates the need for any external intermediary mappings or usage of external memory, as the influence of SHIL allows oscillators to act as both memory and computation units. Planar 4-coloring problems of sizes up to 2116 nodes are mapped to the proposed architecture. Simulations demonstrate that the proposed Potts machine provides exact solutions for smaller problems (e.g. 49 nodes) and generates solutions reaching up to 97% accuracy for larger problems (e.g. 2116 nodes).
09:05 CEST TS07.8 ADAPT-PNC: MITIGATING DEVICE VARIABILITY AND SENSOR NOISE IN PRINTED NEUROMORPHIC CIRCUITS WITH SO ADAPTIVE LEARNABLE FILTERS
Speaker:
Tara Gheshlaghi, KIT - Karlsruher Institut für Technologie, DE
Authors:
Tara Gheshlaghi1, Priyanjana Pal1, Haibin Zhao1, Michael Hefenbrock2, Michael Beigl1 and Mehdi Tahoori1
1Karlsruhe Institute of Technology, DE; 2RevoAI GmbH, DE
Abstract
The rise of the Internet of Things demands flexible, biocompatible, and cost-effective devices. Printed electronics provide a solution through low-cost and on-demand additive manufacturing on flexible substrates, making them ideal for IoT applications. However, variations in additive manufacturing processes pose challenges for reliable circuit fabrication. Adapting neuromorphic computing to printed electronics could address these issues. Printed neuromorphic circuits offer robust computational capabilities for near-sensor processing in IoT. One limitation of existing printed neuromorphic circuits is their inability to process temporal sensory inputs. To address this, integrating temporal components in printed neuromorphic circuit architectures enables the effective processing of time-series sensory data. Printed neuromorphic circuits face challenges from manufacturing variations such as ink dispersion, sensor noise, and temporal fluctuations, especially when processing temporal data and using time-dependent components like capacitors. To mitigate these challenges, we propose robustness-aware temporal processing neuromorphic circuits with low-pass second-order learnable filters (SO-LF). This approach integrates variation awareness by considering the variation potential of component values during training and using data augmentation to enhance adaptability against physical and sensor data variations. Simulations on 15 benchmark time-series datasets show our circuit effectively handles noisy temporal information under 10% process variations, achieving an average accuracy and power improvement of ≈ 24.7% and ≈ 91% respectively compared to models lacking variation with ≈ 1.9× more devices.
09:10 CEST TS07.9 SELF-ADAPTIVE ISING MACHINES FOR CONSTRAINED OPTIMIZATION
Speaker and Author:
Corentin Delacour, University of California, Santa Barbara, US
Abstract
Ising machines (IMs) are physics-inspired alternatives to von Neumann architectures for solving hard optimization tasks. By mapping binary variables to coupled Ising spins, IMs can naturally solve unconstrained combinatorial optimization problems such as finding maximum cuts in graphs. However, despite their importance in practical applications, constrained problems remain challenging to solve for IMs that require large quadratic energy penalties to ensure the correspondence between energy ground states and constrained optimal solutions. To relax this requirement, we propose a self-adaptive IM that iteratively shapes its energy landscape using a Lagrange relaxation of constraints and avoids prior tuning of penalties. Using a probabilistic-bit (p-bit) IM emulated in software, we benchmark our algorithm with multidimensional knapsack problems (MKP) and quadratic knapsack problems (QKP), the latter being an Ising problem with linear constraints. For QKP with 300 variables, the proposed algorithm finds better solutions than state-of-the-art IMs such as Fujitsu's Digital Annealer and requires 7,500x fewer samples. Our results show that adapting the energy landscape during the search can speed up IMs for constrained optimization.
09:15 CEST TS07.10 ENABLING SNN-BASED NEAR-MEA NEURAL DECODING WITH CHANNEL SELECTION: AN OPEN-HW APPROACH
Speaker:
Gianluca Leone, Università degli Studi di Cagliari, IT
Authors:
Gianluca Leone, Luca Martis, Luigi Raffo and Paolo Meloni, Università degli Studi di Cagliari, IT
Abstract
Advancements in CMOS microelectrode array sensors have significantly improved sensing area and resolution, paving the way to accurate Brain-Machine Interfaces (BMIs). However, near-sensor neural decoding on implantable computing devices is still an open problem. A promising solution is provided by Spiking Neural Networks (SNNs), which leverage event sparsity to improve energy consumption. However, given the typical data rates involved, the workload related to I/O acquisition and spike encoding is dominant and limits the benefits achievable with event-based processing. In this work, we present two power-efficient implementations, on FPGA and ASIC, of a dedicated processor for the decoding of intracortical action potentials from primary motor cortex. The processor leverages lightweight sparse SNNs to achieve state-of-the-art accuracy. To limit the impact of I/O transfers on energy efficiency, we introduced a channel selection scheme that reduced bandwidth requirements by 3x and power consumption by 2.3x and 1.6x on the FPGA and ASIC, respectively, enabling inference at 0.446 µJ and 1.04 µJ, with no significant loss in accuracy. To promote broad adoption in a specialized, research-intensive domain, we have based our implementations on open-source EDA tools, low-cost hardware, and an open PDK.
09:20 CEST TS07.11 TOWARDS FAST AUTOMATIC DESIGN OF SILICON DANGLING BOND LOGIC
Speaker:
Jan Drewniok, TU Munich, DE
Authors:
Jan Drewniok1, Marcel Walter1, Samuel Ng2, Konrad Walus2 and Robert Wille1
1TU Munich, DE; 2University of British Columbia, CA
Abstract
In recent years, Silicon Dangling Bond (SiDB) logic has emerged as a promising beyond-CMOS technology. Unlike conventional circuit technology, where logic is realized through transistors, SiDB logic utilizes quantum dots with variable charge states. By strategically arranging these dots, logic functions can be constructed. However, determining such arrangements is a tremendously complex task. Because of that, the automatic obtainment of SiDB logic implementations is inefficient. To address this challenge, we propose an idea to speed up the design process by utilizing dedicated search space pruning strategies. Initial results show that the combined pruning techniques yield 1) a drastic reduction of the search space, and 2) a corresponding reduction in runtime by up to a factor of 33.
09:21 CEST TS07.12 LOADING-AWARE MIXING-EFFICIENT SAMPLE PREPARATION ON PROGRAMMABLE MICROFLUIDIC DEVICE
Speaker:
Debraj Kundu, TU Munich, DE
Authors:
Debraj Kundu1, Tsun-Ming Tseng2, Shigeru Yamashita3 and Ulf Schlichtmann2
1TU Munich (TUM), DE; 2TU Munich, DE; 3Ritsumeikan University, JP
Abstract
Sample preparation, where a certain number of reagents must be mixed in a specific volumetric ratio, is an integral step for various bio-assays. A programmable microfluidic device (PMD) is an advanced flow-based microfluidic biochip (FMB) platform, that considered to be very effective for sample preparation. However, the impact of mixer placement, reagents' distribution, and mixing time on the automation of sample preparation has not yet been investigated. We consider a mixing efficiency model controlled by the number of alternations "μ" of reagents along the mixing circulation path and propose a loading-aware placement strategy that maximizes the mixing efficiency. We use satisfiability modulo theories (SMT) and propose a one-pass strategy for placing the mixers and the reagents, that successfully enhance the loading and mixing efficiencies.

W02 Heterogeneous Integration: from advanced 3D technology to innovative computing architectures

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 12:30 CEST


ASD05 ASD focus session: Teleoperation as a Step Towards Fully Autonomous Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST

Organisers:
Frank Diermeyer, TU Munich, DE
Rolf Ernst, TU Braunschweig, DE

In the foreseeable future, highly automated mobile systems, such as vehicles, robots, UAVs, or trains, will be confronted with difficult situations that require external support. The availability of such external support corresponds to level 4 driving automation and is an essential feature in current robotaxis and automated public transportation. While the first generation of level 4 prototypes relied on safety driver support, commercial systems are gradually moving towards support by teleoperation. Designing teleoperation support for level 4 systems is an end-to-end problem involving two main research and practical challenges, the teleoperation function defining the remote human interface with its scene representation and available control functions, and the real-time communication channel involving wired and wireless segments, which must provide reliable end-to-end data transport.

Time Label Presentation Title
Authors
11:00 CEST ASD05.1 AUTOMATED VEHICLE TELEOPERATION – VISION AND CHALLENGES.
Presenter:
Frank Diermeyer, TU Munich, DE
Author:
Frank Diermeyer, TU Munich, DE
Abstract
.
11:20 CEST ASD05.3 RELIABLE REAL-TIME COMMUNICATION FOR TELEOPERATION
Presenter:
Selma Saidi, Technische Universität Braunschweig, DE
Author:
Selma Saidi, Technische Universität Braunschweig, DE
Abstract
.
11:30 CEST ASD05.4 PANEL DISCUSSION
Presenter:
All the Panelists, DATE 2025, FR
Author:
All the Panelists, DATE 2025, FR
Abstract
.

FS06 Focus Session: Improving Chip Design Enablement for Universities in Europe

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST

Session chair:
Ulf Schlichtmann, TU Munich, DE

Session co-chair:
Holger Blume, Leibniz University Hannover, DE

Organisers:
Norbert Wehn, University of Kaiserslautern-Landau, DE
Lukas Krupp, University of Kaiserslautern-Landau, DE

Time Label Presentation Title
Authors
11:00 CEST FS06.1 PANEL: IMPROVING CHIP DESIGN ENABLEMENT FOR UNIVERSITIES IN EUROPE
Speaker:
Norbert Wehn, RPTU University of Kaiserslautern-Landau, DE
Authors:
X. Sharon Hu1, Joachim Rodrigues2, Luca Benini3, Ian O'Connor4, Andreas Brüning5 and Patrick Haspel6
1University of Notre Dame, US; 2Lund University, SE; 3ETH Zurich, CH | Università di Bologna, IT; 4Lyon Institute of Nanotechnology, FR; 5FMD, DE; 6Synopsys, DE
Abstract
The semiconductor industry is central to the European economy, particularly in the industrial and automotive sectors. Semiconductor fabrication and chip design are the two largest segments of the microelectronics value chain. While Europe is strengthening semiconductor fabrication and technology with considerable investments, e.g., in new fabs, chip design capabilities fall far short of the required capacities. The EU MicroElectronics Training, Industry and Skills (METIS) Report 2023 has shown that chip designers are the job profiles identified as the most difficult to find in the European microelectronics industry. European universities face many challenges hindering their ability to produce skilled graduates and contribute to the semiconductor ecosystem. While student interest in, e.g., AI is booming, we observe a decreasing interest in microelectronics. The main reasons for this are the high entry barriers for students, reinforced by the lack of chip design enablement in academia. Hence, there are ongoing initiatives in different European countries, on the EU level, and worldwide to strengthen chip design education and research. This focus session will bring together stakeholders of these initiatives from Europe and the USA to explore the critical challenges, opportunities, and potential strategies facing chip design enablement in European academic institutions. The session will be held in the panel format with active audience participation to guarantee inclusiveness and foster a broad view of the topic.

MPP01 Driving KDT JU Initiative towards the Chips Act Multi-Partner Projects

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST MPP01.1 MULTI-PARTNER PROJECT: A MODEL-DRIVEN ENGINEERING FRAMEWORK FOR FEDERATED DIGITAL TWINS OF INDUSTRIAL SYSTEMS (MATISSE)
Speaker:
Djamel Eddine Khelladi, CNRS, University of Rennes, FR
Authors:
Alessio Bucaioni1, Romina Eramo2, Luca Berardinelli3, Hugo Bruneliere4, Benoit Combemale5, Djamel Khelladi5, Vittoriano Muttillo2, Andrey Sadovykh6 and Manuel Wimmer3
1Mälardalen University, SE; 2University of Teramo, IT; 3JKU, AT; 4IMT atlantique, FR; 5IRISA, FR; 6Softeam, FR
Abstract
Digital twins are virtual representations of real-world entities or systems. Their primary goal is to help organizations understand and predict the behaviour and properties of these entities or systems. Additionally, digital twins enhance activities such as monitoring, verification, validation, and testing. However, the inherent complexity of digital twins implies challenges throughout the systems engineering process. This notably includes design, development, and analysis phases, as well as deployment, execution, and maintenance. Moreover, existing approaches, methods, techniques, and tools for modelling, simulating, validating, and monitoring single digital twins must now address the increased complexity in federation scenarios. These scenarios introduce new challenges, such as digital twin identification, shared metadata, cross-digital twin communication and synchronization, and federation governance. The KDT Joint Undertaking MATISSE project tackles these challenges by aiming to provide a model-driven framework for the continuous engineering of federated digital twins. It leverages model-driven engineering techniques and practices as the core enabling technology, with traceability serving as an essential infrastructural service for the digital twins federation. In this paper, we introduce the MATISSE conceptual framework for digital twins, highlighting both the novelty of the project's concept and its technical objectives. As the project is still in its initial phase, we identify key research challenges relevant to the DATE community and propose a preliminary research roadmap. This roadmap addresses traceability and federation mechanisms, the required continuous engineering strategy, and the development of digital twin-based services for verification, validation, prediction, and monitoring. To illustrate our approach, we present two concrete scenarios that demonstrate practical applications of the MATISSE conceptual framework.
11:05 CEST MPP01.2 MULTI-PARTNER PROJECT: ELECTRIC VEHICLE DATA ACQUISITION AND VALORISATION: A PERSPECTIVE FROM THE OPEVA PROJECT
Speaker:
Gianluigi Ferrari, University of Parma, IT
Authors:
Alper Kanak1, Salih Ergün2, İbrahim Arif3, Ali Serdar Atalay4, Serhat Ege İnanç4, Oguzhan Herkiloğlu5, Ahmet Yazıcı6, Yunus Sabri Kirca6, Muhammed Ozberk7, Kerem Sarı7, Ali Kafalı7, Dilara Bayar7, Muhammed Oğuz Taş8, Luca Davoli9, Laura Belli9, Gianluigi Ferrari9, Badar Muneer10, Valentina Palazzi9, Luca Roselli10 and Fabio Gelati11
1Ergünler R&D Co.Ltd., TR; 2Ergünler R&D Co. Ltd., TR; 3Ergtech SP.Z.O.O., PL; 4AI4SEC OÖ, EE; 5Bitnet Bilişim Hizmetleri Ltd., TR; 6Eskişehir Osmangazi University, TR; 7ACD Data Engineering, TR; 8INO Robotics, TR; 9University of Parma, IT; 10University of Perugia, IT; 11Luna Geber Engineering s.r.l., IT
Abstract
The OPtimization of Electric Vehicle Autonomy (OPEVA) project enhances data aggregation for Electric Vehicles (EVs) by collecting critical real-time data (i.e., vehicle performance, battery health, charging behaviours) through heterogeneous data acquisition devices built on robust HW and integrated with Internet of Things (IoT) protocols. By combining internal sensor data and driver-specific behaviours with external information (e.g., road conditions, charging station availability), OPEVA maximizes vehicles performance, establishing secure and seamless data communication between EVs and the infrastructure, and using IoT and cloud computing tools alongside Vehicle-to-Everything (V2X) devices and networks. This paper focuses on the extensible data model ensuring semantic data integrity considering extit{in-} and extit{out-vehicle} factors, presenting data acquisition solutions dealing with OPEVA's semantic data model and their use in various Artificial Intelligence (AI)-powered use cases (e.g., range prediction, route optimization, battery management).
11:10 CEST MPP01.3 MULTI-PARTNER PROJECT: A DEEP LEARNING PLATFORM TARGETING EMBEDDED HARDWARE FOR EDGE-AI APPLICATIONS (NEUROKIT2E)
Speaker:
Rajendra Bishnoi, TU Delft, NL
Authors:
Rajendra Bishnoi1, Mohammad Yaldagard1, Kanishkan Vadivel2, Manolis Sifalakis2, Nicolas Rodriguez3, Pedro Julian4, Lothar Ratschbacher3, Maen Malla5, Yogesh Pati5, Rashid Ali5 and Fabian Chersi6
1TU Delft, NL; 2IMEC Netherlands, NL; 3Silicon Austria Labs, AT; 4Universidad Nacional del Sur IIIE-DIEC, AR; 5Fraunhofer IIS, DE; 6CEA, FR
Abstract
The goal of the NEUROKIT2E project (EU HORIZON-JU-RIA) is to create an open-source Deep Learning framework for edge and embedded AI, built around an established European value chain. This framework supports a wide range of application areas that operate independently and serve a global user community. It provides easy and fast full-stack solutions, from AI application development to Neural Network design and optimization, all the way down to hardware implementations, while enabling code generation for application-specific targets. This platform provides flexibility for academic users in the AI domain to explore and innovate while allowing them the possibility to prototype systems, ensuring their work aligns well with industrial needs. This paper presents the results and achievements of the first part of this three-year project, along with its roadmap and expected outcomes.
11:15 CEST MPP01.4 MULTI-PARTNER PROJECT: SPORTS PERFORMANCE AND HEALTH ASSESSMENT IN THE DISTRIMUSE PROJECT
Speaker:
Gianluigi Ferrari, University of Parma, IT
Authors:
Luca Davoli1, Laura Belli1, Veronica Mattioli1, Gianluigi Ferrari1, Lorenzo Priano2, Jaromir Hubalek3, Lukáš Smital3, Andrea Němcová3, Daniela Chlíbková3, Vlastimil Benes4 and Johan Plomp5
1University of Parma, IT; 2University of Turin, IT; 3Brno University of Technology, CZ; 4IMA s.r.o., CZ; 5VTT Oy, FI
Abstract
In our increasingly tech-saturated world, from mobile apps and health sensors to autonomous cars and factory robots, we expect these devices to seamlessly integrate into our lives, enhancing safety and convenience. However, as these devices proliferate and their autonomy grows, ensuring they provide unobtrusive, yet effective support becomes crucial. The Horizon Europe KST multi-partner project "Distributed Multi-Sensor Systems for Human Safety and Health" (DistriMuSe) intends to support human health and safety by improved sensing of human presence, behaviour, and vital signs in a collaborative or common environment by means of multi-sensor systems, distributed processing and Machine/Deep Learning (ML/DL) techniques. In this paper, we focus on the DistriMuSe's approach on sports performance and health assessment, focusing on monitoring the physical activity of non-professional and hobby athletes, people who like sports and care about their health, elderly healthy people, and subjects affected by neurological disability (e.g., Parkinson's disease). The overall goal is to measure activity and exertion, estimating performance levels and determining maximum effort. We discuss the overall system-of-systems architecture, focusing on the adopted technologies.
11:20 CEST MPP01.5 MULTI-PARTNER PROJECT: ADVANCING THE EDA TOOLS LANDSCAPE FOR THE EUROPEAN RISC-V ECOSYSTEM IN TRISTAN
Speaker:
Bernhard Fischer, Siemens, AT
Authors:
Fatma Jebali1, Caaliph Andriamisaina2, Mathieu Jan2, Wolfgang Ecker3, Florian Egert4, Bernhard Fischer4, Alessio Burrello5, Daniele Jahier Pagliari5, Sara Vinco5, Giuseppe Tagliavini6, Ingo Feldner7, Andreas Mauderer7, Axel Sauer7, Arnór Kristmundsson8, Alexander Schober8, Téo Bernier9, Matti Käyrä10, Ulf Schlichtmann11 and Rocco Jonack12
1CEA LIST, FR; 2CEA-List, FR; 3Infineon Technologies, DE; 4Siemens, AT; 5Politecnico di Torino, IT; 6Università di Bologna, IT; 7Robert Bosch GmbH, DE; 8Codasip, DE; 9Thales Research & Technology, FR; 10Tampere University, FI; 11TU Munich, DE; 12MINRES Technologies GmbH, DE
Abstract
The TRISTAN project aims to expand and industrialize the European RISC-V ecosystem to compete effectively with existing commercial alternatives. This initiative specifically targets the critical challenges in the development of Electronic Design Automation (EDA) tools, essential for RISC-V-based solutions, by leveraging the synergy between the open source community and industrial solutions. This paper presents an overview of the current landscape of TRISTAN's EDA flow, highlighting specific tools and methodologies that streamline the early design phases of RISC-V-based systems. We explore the unique features of these tools, emphasizing how they complement each other to strengthen the overall design process.
11:25 CEST MPP01.6 MULTI-PARTNER PROJECT: ENABLING DIGITAL TECHNOLOGIES FOR HOLISTIC HEALTH-LIFESTYLE MOTIVATIONAL AND ASSISTED SUPERVISION SUPPORTED BY ARTIFICIAL INTELLIGENCE (H2TRAIN)
Speaker:
Juan Antonio Montiel Nelson, Institute for Applied Microelectronics University of Las Palmas de Gran Canaria Las Palmas de G.C., ES
Authors:
Juan Antonio Nelson1, Marco Ottella2 and Paolo Azzoni3
1Universidad de Las Palmas, ES; 2Xtremion, IT; 3INSIDE Industry Association, NL
Abstract
H2TRAIN aligns with the ECS Strategic Research and Innovation Agenda 2023 (ECS-SRIA), addressing key challenges in integrating digital technologies for health-focused lifestyles through AI-enhanced networks. This project pioneers the use of graphene to develop autonomous biosensors within CMOS technology, supporting advancements in AI-powered health services and IoT applications, covering the entire edge-to-cloud continuum. Beyond digital integration, H2TRAIN innovates in energy detection, collection, and storage, essential for embedding health and sports functions in IoT wearables through smart textile and system integration. The solutions will be rigorously tested and validated with insights from medical, sports, social sciences, and end-user feedback. Focused on remote assisted living, amateur sports training, and post-operative monitoring, H2TRAIN aims to drive innovation in the smart healthcare sector, where investment in semiconductor nanofabrication is limited by the small scale of medical applications.
11:30 CEST MPP01.7 MULTI-PARTNER PROJECT: DRIVING THE VEHICLE OF THE FUTURE: HOW FEDERATE AND HAL4SDV ARE SHAPING EUROPE'S SOFTWARE-DEFINED VEHICLE ECOSYSTEM
Speaker:
Michael Paulweber, AVL, AT
Authors:
Michael Paulweber1, Andreas Eckel2 and Paolo Azzoni3
1AVL-Instrumentation and test systems, AT; 2TTTech Computertechnik AG, DE; 3Inside Industry Association, NL
Abstract
The FEDERATE and HAL4SDV projects aim to address the growing importance of software in the automotive industry, positioning Europe as a leader in the software-defined vehicle (SDV) domain. FEDERATE focuses on building a cohesive European SDV ecosystem by coordinating stakeholders such as OEMs, semiconductor companies, and research institutions. It supports the agile development of non-differentiating software through open-source collaboration, fostering a vibrant SDV community and providing guidance for ongoing and future SDV projects. Meanwhile, HAL4SDV aligns with the EU's Strategic Research and Innovation Agenda to develop technologies and processes needed for SDV advancement beyond 2030. HAL4SDV's objectives include creating a unified software interface, hardware abstraction, and Over-The-Air updates, while focusing on cybersecurity, real-time capabilities, and seamless integration with smart city infrastructure. Together, these projects aim to drive innovation, scalability, and sustainability in the SDV space.
11:31 CEST MPP01.8 MULTI-PARTNER PROJECT: ARTIFICIAL INTELLIGENCE IN MANUFACTURING LEADING TO SUSTAINABILITY AND THE CONSIDERATION OF HUMAN ASPECTS (AIMS5.0)
Speaker:
Anouar Nechi, University of Lübeck, DE
Authors:
Anouar Nechi1, Yasin Ghafourian2, Belal Abu Naim2, Thomas Gutt3, Georgios Dimitrakopoulos4, Amira Moualhi1, Mladen Berekovic1, Pal Varga5 and Markus Tauber2
1University of Lübeck, DE; 2Research Studios Austria, AT; 3Infineon Technologies, DE; 4Harokopio University of Athens, GR; 5Budapest University of Technology and Economics, HU
Abstract
The industrial landscape is undergoing a transformative shift towards Industry 5.0, a paradigm characterized by the convergence of sustainability, digital autonomy, and human-centric design. This article focuses on the adoption, enhancement, and implementation of AI-driven hardware, tools, methodologies, and semiconductor technologies in this progression. We present here a comprehensive strategy from the AIMS5.0 project with the objective of connecting academic developments with practical industrial use, fostering a harmonious relationship between humans and machines to improve efficiency, spur innovation, and enhance adaptability. Hence we show here our global vision, and examples of how the creation of AI-based industrial solutions is supported by novel AI-tool chains, advancements in hardware, and tools supporting human aspects.

SD02 Special Day on AI and ML Trends

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST

This Special Day focuses on exploring the latest trends and innovations in Artificial Intelligence (AI) and Machine Learning (ML) in the context of DATE. As AI (and mainly generative AI) is booming, especially since the release of chat-GPT, we expect AI/ML will change the way to approach Design, Automation, and Test. In this context, field experts will present their thoughts on the challenges and opportunities of AI/ML, and will engage the audience in an open discussion about the trends that the DATE community should pursue.

This Special Day will highlight the following topics:
* Design of hardware architectures and software, including automatic exploration of large design spaces, assistance of the human designer, resource selection and optimization
* Verification of hardware architectures, with topics such as performance prediction, (formal) design validation, accelerating simulations thanks to AI-Augmented Surrogate Models
* AI-Accelerated Physical Design and Validation of layout and floorplans
* New AI accelerators architectures

These topics will be addressed by a lineup of six distinguished speakers, experts in their respective fields. The day will conclude with a panel discussion allowing experts and the audience to engage in an informal exchange of ideas and trigger discussions on the future research directions and/or the interaction between the various domains presented during the day.

This special day is the ideal even for AI/ML researchers, data scientists, hardware designers, software developers, sustainability advocates, and anyone interested in the future directions of AI and ML for Design, Automation and Test.

Time Label Presentation Title
Authors
11:00 CEST SD02.1 TBD
Presenter:
Siddharth Garg, New York University, US
Author:
Siddharth Garg, New York University, US
Abstract
.
11:25 CEST SD02.2 TBD
Presenter:
Orlando Moreira, Snapchat, US
Author:
Orlando Moreira, Snapchat, US
Abstract
.
11:50 CEST SD02.3 TBD
Presenter:
Wolfgang Ecker, Infineon Technologies, DE
Author:
Wolfgang Ecker, Infineon Technologies, DE
Abstract
.
12:15 CEST SD02.4 ROUND TABLE WITH THE SPEAKERS, DISCUSSION WITH THE AUDIENCE
Presenter:
All the Panelists, DATE, FR
Author:
All the Panelists, DATE, FR
Abstract
.

TS08 Design Methodologies and Applications for Machine Learning

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS08.1 FILTER-BASED ADAPTIVE MODEL PRUNING FOR EFFICIENT INCREMENTAL LEARNING ON EDGE DEVICES
Speaker:
Jing-Jia Hung, National Taiwan University & TSMC, TW
Authors:
Jing-Jia Hung1, Yi-Jung Chen2, Hsiang-Yun Cheng3, Hsu Kao4 and Chia-Lin Yang1
1National Taiwan University, TW; 2Department of Computer Science and Information Engineering, National Chi Nan University, TW; 3Academia Sinica, TW | National Taiwan University, TW; 4National Tsing Hua University, TW
Abstract
Incremental Learning (IL) enhances Machine Learning (ML) models over time with new data, ideal for edge devices at the forefront of data collection. However, executing IL on edges faces challenges due to limited resources. Common methods involve IL followed by model pruning or specialized IL methods for edges. However, the former increases training time due to fine-tuning and compromises accuracy for past classes due to limited retained samples or features. Meanwhile, existing edge-specific IL methods utilize weight pruning, which requires specialized hardware or compilers to speed up and cannot reduce computations on general embedded platforms. In this paper, we propose Filter-based Adaptive Model Pruning (FAMP), the first pruning method designed specifically for IL. FAMP prunes the model before the IL process, allowing fine-tuning to occur concurrently with IL, thereby avoiding extended training time. To maintain high accuracy for both new and past data classes, FAMP adapts the compressed model based on observed data classes and retains filter settings from the previous IL iteration to mitigate forgetting. Across all tests, FAMP achieves the best average accuracy, with only a 2.78% accuracy drop over full ML models with IL. Moreover, unlike the common methods that prolong training time, FAMP takes 35% shorter training time on average than using the full ML models for IL.
11:05 CEST TS08.2 DYLGNN: EFFICIENT LM-GNN FINE-TUNING WITH DYNAMIC NODE PARTITIONING, LOW-DEGREE SPARSITY, AND ASYNCHRONOUS SUB-BATCH
Speaker:
Zhen Yu, SHANGHAI JIAO TONG UNIVERSITY, CN
Authors:
zhen yu, Jinhao Li, Jiaming Xu, Shan Huang, Jiancai Ye, Ningyi Xu and Guohao Dai, Shanghai Jiao Tong University, CN
Abstract
Text-Attributed Graphs (TAGs) tasks involve both textual node information and graph topological structure. The top-k method, using Language Models (LMs) for text encoding and Graph Neural Networks (GNNs) for graph processing, offers the best accuracy while balancing memory and training time. However, challenges still exist: (1) Static sampling of k neighbors reduces performance. Using a fixed k can result in sampling too few or too many nodes, leading to a 3.2% accuracy loss across datasets. (2) Time-consuming processing for non-trainable nodes. After partitioning all nodes into with-gradient trainable and without-gradient non-trainable sets, the number of non-trainable nodes is ∼9-10× larger than trainable nodes, resulting in nearly 70% of the total time. (3) Time-consuming data movement. For processing non-trainable nodes, after the text strings are tokenized into tokens on the CPU side, the data movement from host memory to GPU takes 30%-40% of the time. In this paper, we propose DyLGNN, an efficient end-to-end LM-GNN fine-tuning framework through three innovations: (1) Heuristic Node Partitioning. We propose an algorithm that dynamically and adaptively selects "important" nodes to participate in the training process for downstream tasks. Compared to the static top-k method, we reduce the training memory usage by 24.0%. (2) Low-Degree Sparse Attention. We point out that the embedding of low-degree nodes has minimal impact on the final results (e.g. ∼1.5% accuracy loss), therefore, We perform sparse attention computation on low-degree nodes to further reduce the computation caused by "unimportant" nodes, achieving an average 1.27× speedup. (3) Asynchronous Sub-batch Pipeline. Within the top-k framework, we analyze the time breakdown of the LM inference component. Leveraging our heuristic node partitioning, which effectively minimizes memory demands, we can asynchronously execute data movement and computation, thereby overlapping the time required for data movement. This improves GPU utilization and results in an average 1.1× speedup. We conduct experiments on several common graph datasets, and by combining the three methods mentioned above, DyLGNN achieves a 22.0% reduction in memory usage and a 1.3× end-to-end speedup compared to the top-k strategy.
11:10 CEST TS08.3 ITERL2NORM: FAST ITERATIVE L2-NORMALIZATION
Speaker:
ChangMin Ye, Hanyang University, KR
Authors:
ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin and Doo Seok Jeong, Hanyang University, KR
Abstract
Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where 64<=d<=1024, with a latency of 116-227 cycles at 100MHz/1.05V.
11:15 CEST TS08.4 MPTORCH-FPGA: A CUSTOM MIXED PRECISION FRAMEWORK FOR FPGA-BASED DNN TRAINING
Speaker:
Sami BEN ALI, Inria Rennes, FR
Authors:
Sami BEN ALI1, Silviu-Ioan Filip1, Olivier Sentieys1 and Guy Lemieux2
1INRIA, FR; 2University of British Columbia, CA
Abstract
Training Deep Neural Networks (DNNs) is compu- tationally demanding, leading to a growing interest in reduced precision formats to enhance hardware efficiency. Several frame- works explore custom number formats with parameterizable pre- cision through software emulation on CPUs or GPUs. However, they lack comprehensive support for different rounding modes and struggle to accurately evaluate the impact of custom precision for FPGA-based targets. This paper introduces MPTorch-FPGA, an extension of the MPTorch framework for performing custom, multi-precision inference and training computations in CPU, GPU, and FPGA environments in PyTorch. MPTorch-FPGA can generate a model-specific accelerator for DNN training , with customizable sizes and arithmetic implementations, providing bit-level accuracy with respect to emulated low precision DNN training on GPUs or CPUs. An offline matching algorithm selects one of several pre-generated (static) FPGA configurations using a custom performance model to estimate latency. To showcase the versatility of MPTorch-FPGA, we present a series of training benchmarks using diverse DNN models, exploring a range of number format configurations and rounding modes. We report both accuracy and hardware performance metrics, verifying the precision of our performance model by comparing estimated and measured latencies across multiple benchmarks. These results highlight the flexibility and practical value of our framework.
11:20 CEST TS08.5 MEMHD: MEMORY-EFFICIENT MULTI-CENTROID HYPERDIMENSIONAL COMPUTING FOR FULLY-UTILIZED IN-MEMORY COMPUTING ARCHITECTURES
Speaker:
Do Yeong Kang, Sungkyunkwan University, KR
Authors:
Do Yeong Kang, Yeong Hwan Oh, Chanwook Hwang, Jinhee Kim, Kang Eun Jeon and Jong Hwan Ko, Sungkyunkwan University, KR
Abstract
Hyperdimensional Computing (HDC) has shown great potential in brain-inspired computing, but its integration with In-Memory Computing (IMC) faces challenges due to high-dimensional vector operations and memory utilization issues. This paper introduces a novel multi-centroid Associative Memory (AM) structure for HDC implemented on IMC architectures, addressing these challenges while maintaining high accuracy in classification tasks. Our approach compresses dimensions through the multi-centroid model, bringing IMC array utilization for Associative Search close to 100% and significantly reducing computations. This dimension compression substantially decreases memory footprint in both the Encoding Module and Associative Memory, while reducing computational requirements. Additionally, we propose innovative initialization and learning methods for multi-centroid AM, including clustering-based initialization for faster convergence and a quantization-aware iterative learning approach for high-accuracy, IMC-compatible AM training. Our adaptive structure optimizes model design based on available hardware resources by adjusting memory columns and rows. Comprehensive evaluations across various classification datasets demonstrate that our method achieves superior memory efficiency at equivalent accuracy levels and improved accuracy at equivalent memory usage compared to conventional HDC models.
11:25 CEST TS08.6 ODIN: LEARNING TO OPTIMIZE OPERATION UNIT CONFIGURATION FOR ENERGY-EFFICIENT DNN INFERENCING
Speaker:
Gaurav Narang, Washington State University, US
Authors:
Gaurav Narang, Jana Doppa and Partha Pratim Pande, Washington State University, US
Abstract
ReRAM-based Processing-In-Memory (PIM) architectures enable energy-efficient Deep Neural Network (DNN) inferencing. However, ReRAM crossbars suffer from various non-idealities that affect overall inferencing accuracy. To address that, the matrix-vector-multiplication (MVM) operations are computed by activating a subset of the full crossbar, referred to as Operation Unit (OU). However, OU configurations vary with the neural layers' features such as sparsity, kernel size and their impact on predictive accuracy. In this paper, we consider the problem of learning appropriate layer-wise OU configurations in ReRAM crossbars for unseen DNNs at runtime such that performance is maximized without loss in predictive accuracy. We employ a machine learning (ML) based framework called Odin, which selects the OU sizes for different neural layers as a function of the neural layer features and time-dependent ReRAM conductance drift. Our experimental results demonstrate that the energy-delay-product (EDP) is reduced by up to 8.7× over state-of-the-art homogeneous OU configurations without compromising predictive accuracy.
11:30 CEST TS08.7 SLIPSTREAM: SEMANTIC-BASED TRAINING ACCELERATION FOR RECOMMENDATION MODELS
Speaker:
Yassaman Ebrahimzadeh Maboud, University of British Columbia, CA
Authors:
Yassaman Ebrahimzadeh Maboud1, Muhammad Adnan1, Divya Mahajan2 and Prashant Jayaprakash Nair1
1University of British Columbia, CA; 2Georgia Tech, US
Abstract
Recommendation models play a crucial role in delivering accurate and tailored user experiences. However, training such models poses significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non- popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings become redundant, lacking any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. Our experiments demonstrate Slipstream's ability to maintain accuracy while effectively discarding updates to non-varying embeddings. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real- world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.
11:35 CEST TS08.8 COMPASS: A COMPILER FRAMEWORK FOR RESOURCE-CONSTRAINED CROSSBAR-ARRAY BASED IN-MEMORY DEEP LEARNING ACCELERATORS
Speaker:
Jihoon Park, Seoul National University, KR
Authors:
Jihoon Park, Jeongin Choe, Dohyun Kim and Jae-Joon Kim, Seoul National University, KR
Abstract
Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods.
11:40 CEST TS08.9 OPS: OUTLIER-AWARE PRECISION-SLICE FRAMEWORK FOR LLM ACCELERATION
Speaker:
Fangxin Liu, Shanghai Jiao Tong University, CN
Authors:
Fangxin Liu1, Ning Yang1, Zongwu Wang1, Xuanpeng Zhu2, Haidong Yao2, Xiankui Xiong2, Qi Sun3 and Li Jiang1
1Shanghai Jiao Tong University, CN; 2ZTE Corporation, CN; 3Zhejiang University, CN
Abstract
Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness. Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic. To address this, we propose OPS (Outlier-aware Precision-Slicing), an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, OPS introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, OPS outperforms state-of-the-art outlier-aware accelerators, achieving a $1.3-4.3 imes$ performance boost and $14.3-66.7\%$ greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient on-device LLM deployment, effectively balancing computational efficiency and model accuracy.
11:41 CEST TS08.10 OPENC2: AN OPEN-SOURCE END-TO-END HARDWARE COMPILER DEVELOPMENT FRAMEWORK FOR DIGITAL COMPUTE-IN-MEMORY MACRO
Speaker:
TIANCHU DONG, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Tianchu Dong, Shaoxuan Li, Yihang Zuo, Hongwu Jiang, Yuzhe Ma and Shanshi Huang, The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
Digital Compute-in-Memory (DCIM), which inserts logic circuits into SRAM arrays, presents a significant advancement in CIM architecture. DCIM has shown great potential in applications, and the diversity of applications requires rapid hardware iteration. However, the hardware design flow from user specifications to layout is extremely tedious and time-consuming for manual design. Commercial EDA tools are limited by restrictive licenses and the inability to specifically optimize the datapath, which calls for an open-source end-to-end hardware compiler for DCIM. This paper proposes OpenC2 , the first open-source end-to-end development framework for DCIM macro compilation. OpenC2 provides a template-based generation platform for DCIM macros across various technologies, sizes, and configurations. It can automatically generate a datapath-optimized, compact DCIM macro layout based on a hierarchical physical design methodology. Our experiment results show that OpenC2's compact design on FreePDK45 delivers over 30% area reduction and over 40% improvement in area efficiency compared to AutoDCIM on TSMC40.
11:42 CEST TS08.11 SPEEDING-UP SUCCESSIVE READ OPERATIONS OF STT-MRAM VIA READ PATH ALTERNATION FOR DELAY SYMMETRY
Speaker:
Taehwan Kim, Korea University, KR
Authors:
Taehwan Kim and Jongsun Park, Korea University, KR
Abstract
Recent research on data-intensive computing systems has demonstrated that system throughput and latency are critically dependent on memory read bandwidth, highlighting the need for fast memory read operations. Although spin-transfer torque magnetic random-access memory (STT-MRAM) has emerged as a promising alternative to CMOS-based embedded memories, STT-MRAM continues to face challenges related to read speed and energy efficiency. This paper introduces a novel read scheme that enhances read speed and energy in successive read operations by alternating read paths between data and reference cells. This approach effectively mitigates worst-case read scenarios by balancing the read voltage swings. HSPICE simulations using 28nm CMOS technology show a 31.5% improvement in read speed and 48.8% reduction in energy consumption compared to the previous approach. SCALE-Sim system simulations also demonstrate that applying the proposed read scheme to STT-MRAM embedded memories in AI accelerators shows a significant reduction in memory energy for CNN inference tasks compared to the SRAM embedded memory.

TS09 Low-power, energy-efficient and thermal-aware design

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS09.1 CO-UP: COMPREHENSIVE CORE AND UNCORE POWER MANAGEMENT FOR LATENCY-CRITICAL WORKLOADS
Speaker:
Ki-Dong Kang, Electronics and Telecommunications Research Institute, KR
Authors:
Ki-Dong Kang1, Gyeongseo Park1 and Daehoon Kim2
1Electronics and Telecommunications Research Institute, KR; 2Yonsei University, KR
Abstract
Improving energy efficiency to reduce costs in server environments has attracted considerable attention. Considering that processors account for a significant portion of energy consumption in servers, Dynamic Voltage and Frequency Scaling (DVFS) enhances their energy efficiency by adjusting the operational speed and power consumption of processors. Additionally, modern high-end processors extend DVFS functionality not only to core components but also to uncore parts. This is because the increasing complexity and integration of System on Chips (SoCs) have highlighted the substantial energy consumption. However, existing uncore voltage/frequency scaling fails to effectively consider Latency-Critical (LC) applications, leading to sub-optimal energy efficiency or degraded performance. In this paper, we introduce Co-UP, power management that simultaneously scales core and uncore frequencies for latency-critical applications, designed to improve energy efficiency without violating Service Level Objectives (SLOs). To this end, Co-UP incorporates a prediction model that estimates outcomes of energy consumption and performance as uncore and core frequency changes. Based on the estimated gains, Co-UP adjusts to uncore and/or core frequencies to further enhance energy efficiency or performance. This predictive model can rapidly adapt to new and unlearned loads, enabling Co-UP to operate online without any prior profiling. Our experiments show that Co-UP can reduce energy consumption by up to 28.2% compared to existing Intel's policy and up to 17.6% compared to state-of-the-art power management studies, without SLO violations.
11:05 CEST TS09.2 FLEXIBLE THERMAL CONDUCTANCE MODEL (TCM) FOR EFFICIENT THERMAL SIMULATION OF 3-D ICS AND PACKAGES
Speaker:
Shunxiang Lan, Shanghai Jiao Tong University, CN
Authors:
Shunxiang Lan, Min Tang and Jun Ma, Shanghai Jiao Tong University, CN
Abstract
Thermal management plays an increasingly important role in the design of 3-D integrated circuits (ICs) and packages. To deal with the related thermal issues, efficient and accurate evaluation of the thermal performance is obviously essential. In this paper, an efficient approach with the flexible thermal conductance model (TCM) is presented for thermal simulation of 3-D ICs and packages. Firstly, the entire structure is partitioned and classified into two kinds of regions, named region of interest (ROI) and region of fixity (ROF). The ROI usually contains the key components in thermal designs while the ROF holds invariant thermal characteristics. Then, in order to represent the thermal impact of ROF on ROI, a novel technique based on the TCM is developed, which can be treated as the equivalent boundary condition of the ROI. By this means, the solution domain of the whole system is constrained to the ROI, which results in significant reduction of computational costs. Furthermore, in the representation of ROF, a flexible TCM with elegant rational expressions on the heat convection coefficient is proposed to deal with varying boundary conditions, which greatly expands the applicability of this method. The validity and efficiency of the proposed method is illustrated by the numerical examples, where a 138x speedup is achieved comparing with the commercial software.
11:10 CEST TS09.3 THANOS: ENERGY-EFFICIENT KEYWORD SPOTTING PROCESSOR WITH HYBRID TIME-FEATURE-FREQUENCY-DOMAIN ZERO-SKIPPING
Speaker:
Sangyeon Kim, Sogang University, KR
Authors:
Sangyeon Kim, Hyunmin Kim and Sungju Ryu, Sogang University, KR
Abstract
In recent years, the keyword spotting algorithm has gained significant attention for applications such as personalized virtual assistants. However, the keyword spotting system must be always turned on to listen to the input voice for the recognition, which worsens the battery constraint problem in the edge devices. In this paper, we first analyze the sparsities in the keyword spotting computation. Based on the characteristic, we introduce the keyword spotting processor called Thanos to enable the zero-skipping scheme in the multiple keyword spotting domains to mitigate the burdensome energy consumption. Experimental results show that our hybrid-domain zero-skipping scheme reduces the latency and the energy consumption by 80.3-87.4% and 48.1-79.8%, respectively, over the baseline architecture.
11:15 CEST TS09.4 ALGORITHM-HARDWARE CO-DESIGN OF A UNIFIED ACCELERATOR FOR NON-LINEAR FUNCTIONS IN TRANSFORMERS
Speaker:
Haonan Du, Zhejiang University, CN
Authors:
Haonan Du1, Chenyi Wen1, Zhengrui Chen1, Li Zhang2, Qi Sun1, Zheyu Yan1 and Cheng Zhuo1
1Zhejiang University, CN; 2Hubei University of Technology, CN
Abstract
Non-linear functions (NFs) in Transformers require high-precision computation consuming significant time and energy, despite the aggressive quantization schemes for other components. Piece-wise Linear (PWL) approximation-based methods offer more efficient processing schemes for NFs but fall short in dealing with functions with high non-linearities. Moreover, PWL-based methods still suffer from inevitably high latency introduced by the Multiply-And-Add (MADD) unit. To address these issues, this paper proposes a novel quadratic approximation scheme and a highly integrated, multiplier-less hardware structure, as a unified method to accelerate any unary non-linear function. We also demonstrate implementation examples for GELU, Softmax, and LayerNorm. The experimental results show that the proposed method achieves up to 5.41% higher inference accuracy and 60.12% lower area-delay product.
11:20 CEST TS09.5 EFFICIENT HOLD BUFFER OPTIMIZATION BY SUPPLY NOISE-AWARE DYNAMIC TIMING ANALYSIS
Speaker:
Lishuo Deng, Southeast University, CN
Authors:
Lishuo Deng, Changwei Yan, Cai Li, Zhuo Chen and Weiwei Shan, Southeast University, CN
Abstract
As the CMOS process scales down, digital circuits become more susceptible to hold time violations due to increased sensitivity to supply voltage fluctuations. Since hold time violation is fatal, sufficient hold fixing buffers need to be inserted into the short paths to prevent it. However, by assuming a constant power supply level, traditional hold fixing causes imprecise and overly conservative timing analysis and hence leads to circuit overhead and degraded performance. To address this, we propose a power supply noise (PSN)-aware dynamic timing analysis for realistic hold time analysis and efficient hold buffer optimization, which integrates a machine learning-based timing model into the conventional design flow. Building on the highly effective application of the Weibull cumulative distribution function and machine learning for dynamic PSN-aware timing analysis, we propose introducing an additional parameter for PSN amplitude, which has a significant impact on delay, and narrowing the overall parameter range using real PSN waveforms extracted from the RedHawk. This approach achieves a prediction error of only 3.45% for cell delay and 5.1% for path delay, while also reducing dataset acquisition costs. To the best of our knowledge, this work is the first to apply PSN-aware dynamic timing analysis specifically for hold optimization, mitigating the pessimism of traditional static timing analysis (STA) and effectively minimizing redundant hold fixing buffers while remaining compatible with existing design workflows. Since short paths often overlap with critical paths, reducing redundant hold buffers not only decreases area overhead but also enhances performance. Applied to a 22 nm, 64-point Fast Fourier Transform (FFT) circuit, our EDA compatible method combined with a greedy algorithm reduces hold buffers by 55%, achieving not only 6.79% circuit area reduction but also 8.1% performance improvement due to the elimination of redundant buffers in short and critical paths.
11:25 CEST TS09.6 LARED: EFFICIENT IR DROP PREDICTOR WITH LAYOUT-PRESERVING REBUILDER-ENCODER-DECODER ARCHITECTURE
Speaker:
Zhou Jin, SSSLab, Dept. of CST, China University of Petroleum-Beijing, China, CN
Authors:
ChengXuan Yu1, YanShuang Teng1, WenHao Dai1, YongJiang Li1, Wei Xing2, Xiao Wu3, Dan Niu4 and Zhou Jin5
1Super Scientific Software Laboratory, University of Petroleum-Beijing, CN; 2The University of Sheffield, GB; 3Huada Empyrean Software Co.Ltd, CN; 4Southeast University, CN; 5Super Scientific Software Laboratory, Dept. of CST, China University of Petroleum-Beijing, CN
Abstract
In the realm of integrated circuit verification, IR drop analysis plays a crucial role. Recent advancements in machine learning (ML) significantly enhance its efficiency, yet many current approaches fail to fully leverage the input structure of feature maps and the transmission mechanism of Power Delivery Network (PDN) layouts. To bridge these gaps, we introduce Layout-Preserving Rebuilder-Encoder-Decoder Architecture Predictor (LaRED), which employs a novel Rebuilder-Encoder-Decoder (RED) architecture and utilizes an innovative downsampling approach and upsampling framework to optimize its perception of instances and the transmission of features. LaRED captures information from various regions with asymmetric topological structure while preserving and transferring layout characteristics through deformable convolution, hybrid downsampling, cascaded upsampling, and attentional feature fusion. The rebuilder rebuilds raw input, whereas the encoder ensures comprehensive feature transmission across all instances. The decoder then facilitates seamless transfer of feature information across layers. This approach enables LaRED to integrate chip features of varying topologies and scales, enhancing its representational power. Compared to the current State-Of-The-Art (SOTA), MAUnet, LaRED achieves accuracy improvements of 34.6\% to 42.6\% in benchmark tests, establishing it as the new standard in static IR drop analysis for integrated circuit design with ML techniques. The code is available at https://github.com/Todi85/LaRED.
11:30 CEST TS09.7 COOL3D: COST-OPTIMIZED AND EFFICIENT LIQUID COOLING FOR 3D INTEGRATED CIRCUITS
Speaker:
Jing Li, Beihang University, CN
Authors:
Jing Li1, Bingrui Zhang1, Yuquan Sun1, Wei Xing2 and Yuanqing Cheng1
1Beihang University, CN; 2The University of Sheffield, GB
Abstract
CMOS scaling faces challenges due to lithography and device physics issues, leading to increased costs and difficulties in expanding chip footprint. 3D integration technology offers increased integration density without increasing footprint, but elevated power density makes heat dissipation a significant challenge. Microchannel cooling effectively removes heat inside 3D chips. Traditional microchannel optimizations typically focus only on minimizing pump power within a limited parameter design space, leading to suboptimal cooling efficiency. Moreover, existing research rarely considers manufacturing costs, limiting practical application. To address these issues, we propose a high-dimensional non-uniform microchannel design scheme based on Segmented Sampling Bayesian Optimization (SSBO). This multi-parameter collaborative optimization framework comprehensively optimizes microchannel design. Our method reduces pump power by 70% compared to limited parameter design spaces. Additionally, we introduce a cost model for microchannel design, formulating a multi-objective optimization problem that considers both manufacturing cost and pump power consumption. By solving the multi-objective optimization problem by searching for the Pareto front, we demonstrate a balanced design between microchannel manufacturing cost and pump power and provide guidelines for key design parameters.
11:35 CEST TS09.8 JOINT DNN PARTITION AND THREAD ALLOCATION OPTIMIZATION FOR ENERGY-HARVESTING MEC SYSTEMS
Speaker:
Yizhou Shi, Nanjing University of Science and Technology, CN
Authors:
Yizhou Shi, Liying Li, Yue Zeng, Peijin Cong and Junlong Zhou, Nanjing University of Science and Technology, CN
Abstract
Deep neural networks (DNNs) have demonstrated exceptional performance, leading to diverse applications across various mobile devices (MDs). Considering factors like portability and environmental sustainability, an increasing number of MDs are adopting energy harvesting (EH) techniques for power supply. However, the computational intensity of DNNs presents significant challenges for their deployment on these resource-constrained devices. Existing approaches often employ DNN partition or offloading to mitigate the time and energy consumption associated with running DNNs on MDs. Nonetheless, existing methods frequently fall short in accurately modeling the execution time of DNNs, and do not consider to use thread allocation for further latency and energy consumption optimization. To solve these problems, we propose a dynamic DNN partition and thread allocation method to optimize the latency and energy consumption of running DNNs on EH-enabled MDs. Specifically, we first investigate the relationship between DNN inference latency and allocated threads and establish an accurate DNN latency prediction model. Based on the prediction model, a DRL-based DNN partition (DDP) algorithm is designed to find the optimal partitions for DNNs. A thread allocation (TA) algorithm is proposed to reduce the inference latency. Experimental results from our test-bed platform demonstrate that compared to four benchmarking methods, our scheme can reduce DNN inference latency and energy consumption by up to 37.3% and 38.5%.
11:40 CEST TS09.9 FAST DYNAMIC IR-DROP PREDICTION WITH DUAL-PATH SPATIAL-TEMPORAL ATTENTION
Speaker:
Bangqi Fu, The Chinese University of Hong Kong, HK
Authors:
Bangqi Fu, Lixin Liu, Qijing Wang, Yutao Wang, Martin Wong and Evangeline Young, The Chinese University of Hong Kong, HK
Abstract
The analysis of IR-drop stands as a fundamental step in optimizing the power distribution network (PDN), and subsequently influences the design performance. However, traditional IR-drop analysis using commercial tools proves to be exceedingly time-consuming. Fast and accurate IR-drop analysis is desperately in demand to achieve high performance on timing and power. Recently, machine learning approaches have garnered attention owing to their remarkable speed and extensibility in IC designs. However, prior works for dynamic IR-drop prediction presented limited performance since they did not exploit the time-varying activities. In this paper, we proposed a dual-path model with spatial-temporal transformers to extract the static spatial features and dynamic time-variant activities for dynamic IR drop prediction. Experimental results on the large-scale advanced dataset CircuitNet show that our model significantly outperforms the state-of-the-art works.
11:45 CEST TS09.10 A NOVEL FREQUENCY-SPATIAL DOMAIN AWARE NETWORK FOR FAST THERMAL PREDICTION IN 2.5D ICS
Speaker:
Dan Niu, Southeast University, CN
Authors:
Dekang Zhang1, Dan Niu1, Zhou Jin2, Yichao Dong1, Jingweijia Tan3 and Changyin Sun4
1Southeast University, CN; 2Super Scientific Software Laboratory, Dept. of CST, China University of Petroleum-Beijing, CN; 3Jilin University, CN; 4Anhui University, CN
Abstract
In the post-Moore era, 2.5D chiplet-based ICs present significant challenges in thermal management due to increased power density and thermal hotspots. Neural network-based thermal prediction models can perform real-time predictions for many unseen new designs. However, existing CNN-based and GCN-based methods cannot effectively capture the global thermal features, especially for high-frequency components, hindering prediction accuracy enhancement. In this paper, we propose a novel frequency- spatial dual domain aware prediction network (FSA-Heat) for fast and high-accuracy thermal prediction in 2.5D ICs. It integrates high-to-low frequency and spatial domain encoder (FSTE) module with frequency domain cross-scale interaction module (FCIFormer) to achieve high-to-low frequency and global-to-local thermal dissipation feature extraction. Additionally, a frequency-spatial hybrid loss (FSL) is designed to effectively attenuate high-frequency thermal gradient noises and spatial misalignments. The experimental results show that the performance enhancements offered by our proposed method are substantial, outperforming the newly-proposed 2.5D method, GCN+PNA, by considerable margins (over 99% RMSE reduction, 4.23X inference speedup). Moreover, extensive experiments demonstrate that FSA-Heat also exhibits robust generalization capabilities.

TS10 Applications of Artificial Intelligence Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS10.1 TAIL: EXPLOITING UNDERLINE{T}EMPORAL UNDERLINE{A}SYNCHRONOUS EXECUTION FOR EFFICIENT SPIKING NEURAL NETWORKS WITH UNDERLINE{I}NTER-UNDERLINE{L}AYER PARALLELISM
Speaker:
Haomin Li, Shanghai Jiao Tong University, CN
Authors:
Haomin Li1, Fangxin Liu1, Zongwu Wang1, Dongxu Lyu1, Shiyuan Huang1, Ning Yang1, Qi Sun2, Zhuoran Song1 and Li Jiang1
1Shanghai Jiao Tong University, CN; 2Zhejiang University, CN
Abstract
Spiking neural networks (SNNs) are an alternative computational paradigm to artificial neural networks (ANNs) that have attracted attention due to their event-driven execution mechanisms, enabling extremely low energy consumption. However, the existing SNN execution model, based on software simulation or synchronized hardware circuitry, is incompatible with the event-driven nature, thus resulting in poor performance and energy efficiency. The challenge arises from the fact that neuron computations across multiple time steps result in increased latency and energy consumption. To overcome this bottleneck and leverage the full potential of SNNs, we propose TAIL, a pioneering temporal asynchronous execution mechanism for SNNs driven by a comprehensive analysis of SNN computations. Additionally, we propose an efficient dataflow design to support SNN inference, enabling concurrent computation of various time steps across multiple layers for optimal Processing Element (PE) utilization. Our evaluations show that TAIL greatly improves the performance of SNN inference, achieving a $6.94 imes$ speedup and a $6.97 imes$ increase in energy efficiency on current SNN computing platforms.
11:05 CEST TS10.2 EXPLOITING BOOSTING IN HYPERDIMENSIONAL COMPUTING FOR ENHANCED RELIABILITY IN HEALTHCARE
Speaker:
Sungheon Jeong, University of california, irvine, US
Authors:
SungHeon Jeong1, Hamza Errahmouni Barkam1, Sanggeon Yun1, Yeseong Kim2, Shaahin Angizi3 and Mohsen Imani1
1University of California, Irvine, US; 2DGIST, KR; 3New Jersey Institute of Technology, US
Abstract
Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional spaces, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems—a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostHD, an approach that applies boosting algorithms to partition the hyperdimensional space into subspaces, creating an ensemble of weak learners. By integrating boosting with HDC, BoostHD enhances performance and reliability beyond existing HDC methods. Our analysis highlights the importance of efficient utilization of hyperdimensional spaces for improved model performance. Experiments on healthcare datasets show that BoostHD outperforms state-of-the-art methods. On the WESAD dataset, it achieved an accuracy of 98.37 ± 0.32%, surpassing Random Forest, XGBoost, and OnlineHD. BoostHD also demonstrated superior inference efficiency and stability, maintaining high accuracy under data imbalance and noise. In person-specific evaluations, it achieved an average accuracy of 96.19%, outperforming other models. By addressing the limitations of both boosting and HDC, BoostHD expands the applicability of HDC in critical domains where reliability and precision are paramount.
11:10 CEST TS10.3 LCACHE: LOG-STRUCTURED SSD CACHING FOR TRAINING DEEP LEARNING MODELS
Speaker:
Shucheng Wang, China Mobile (Suzhou) Software Technology, CN
Authors:
Shucheng Wang1, Zhiguo Xu1, Zhandong Guo1, Jian Sheng2, Kaiye Zhou1 and Qiang Cao3
1China Mobile (Suzhou) Software Technology Co., Ltd., CN; 2Suzhou City Univercity, CN; 3Huazhong University of Science and Technology, CN
Abstract
Training deep learning models is computational and data-intensive. Existing approaches utilize local SSDs within training servers to cache datasets, thereby accelerating data loading during model training. However, we experimentally observe that data loading remains a performance bottleneck when randomly retrieving small-sized sample files on SSDs. In this paper, we introduce LCache, a log-structured dataset caching mechanism designed to fully leverage the I/O capabilities of SSDs and reduce I/O-induced training stalls. LCache determines the randomized dataset access order by extracting the pseudo-random seed from the training frameworks. It then aggregates small-sized sample files into larger chunks and stores them in a log file on SSDs, thus enabling sequential I/O requests on data retrieval and improving data loading throughput. Furthermore, LCache proposes a real-time log reordering mechanism that strategically schedules cached data to organize logs across different epochs, which enhances cache utilization and minimizes data retrieval from low-performance remote storage systems. Additionally, LCache incorporates an MetaIndex to enable rapid log traversal and querying. We evaluate LCache with various real-world DL models and datasets. LCache outperforms the native PyTorch Dataloader and NoPFS by up to 9.4x and 7.8x in throughput, respectively.
11:15 CEST TS10.4 OLORAS: ONLINE LONG RANGE ACTION SEGMENTATION FOR EDGE DEVICES
Speaker:
Filippo Ziche, Università di Verona, IT
Authors:
Filippo Ziche and Nicola Bombieri, Università di Verona, IT
Abstract
Temporal action segmentation (TAS) is essential for identifying when actions are performed by a subject, with applications ranging from healthcare to Industry 5.0. In such contexts, the need for real-time, low-latency responses and privacy-aware data handling often requires the use of edge devices, despite their limited memory, power, and computational resources. This paper presents OLORAS, a novel TAS model designed for real-time performance on edge devices. By leveraging human pose data instead of video frames and employing linear recurrent units (LRUs), OLORAS efficiently processes long sequences while minimizing memory usage. Tested on the standard Assembly101 dataset, the model outperforms state-of-the-art TAS methods in accuracy with 10x memory footprint reduction, making it well-suited for deployment on resource-constrained devices.
11:20 CEST TS10.5 ONLINE LEARNING FOR DYNAMIC STRUCTURAL CHARACTERIZATION IN ELECTRON ENERGY LOSS SPECTROSCOPY
Speaker:
Lakshmi Varshika Mirtinti, Drexel University, US
Authors:
Lakshmi Varshika M1, Jonathan Hollenbach2, Nicolas Agostini3, Ankur Limaye3, Antonino Tumeo4 and Anup Das1
1Drexel University, US; 2Johns Hopkins University, US; 3Pacific Northwest National Lab, US; 4Pacific Northwest National Laboratory, US
Abstract
In-situ Electron Energy Loss Spectroscopy (EELS) is a crucial technique for determining the elemental composition of materials through EELS Spectrum Images (EELS-SI). While recent innovations have made it possible for EELS-SI data acquisition at rates of 400 frames per second with near-zero read noise, the challenge lies in processing this massive stream of real-time data to capture nanoscale dynamic changes. This task demands advanced machine learning methods capable of identifying subtle and complex features in EELS spectra. Furthermore, the EELS data acquired in difficult experimental conditions often suffer from a low signal-to-noise ratio (SNR), leading to unreliable classification and limiting their utility. In response to this critical need, we introduce a spiking neural network (SNN)-based Variational Autoencoder (VAE) that embeds spectral data into a latent space, facilitating precise prediction of structural changes. VAEs are designed to learn efficient low-dimensional representations while capturing the inherent variability in the data, making them highly effective for processing multi-dimensional data. Additionally, SNNs, which use biological neurons, offer unmatched scalability and energy efficiency by processing information through binary spikes, making them ideal for high-throughput data. We validate our framework using MXene annealing data, achieving denoised spectrum images with an SNR of 28.3dB. For the first time, we present a fully online learning solution for dynamic structural tracking, implemented directly in hardware, eliminating the traditional bottleneck of offline training. Our method achieves reliable, realtime, on-device characterization of high-speed EELS data when evaluated on an FPGA platform. Joint experiments with the SNN-VAE model on both spiking autoencoder hardware and a software-trained hybrid configuration of hardware spiking encoders demonstrated latency reductions of 25.2×, 93.7×, and 1.04×, 4.5× in energy savings, respectively, compared to baseline.
11:25 CEST TS10.6 SCALES: BOOST BINARY NEURAL NETWORK FOR IMAGE SUPER-RESOLUTION WITH EFFICIENT SCALINGS
Speaker:
Renjie Wei, Peking University, CN
Authors:
Renjie Wei1, Zechun Liu2, Yuchen Fan3, Runsheng Wang1, Ru Huang1 and Meng Li1
1Peking University, CN; 2Meta Inc, US; 3Meta, US
Abstract
Deep neural networks for image super-resolution (SR) have demonstrated superior performance. However, the large memory and computation consumption hinders their deployment on resource-constrained devices. Binary neural networks (BNNs), which quantize the floating point weights and activations to 1- bit can significantly reduce the cost. Although BNNs for image classification have made great progress these days, existing BNNs for SR still suffer from a large performance gap between the FP SR networks. To this end, we observe the activation distribution in SR networks and find much larger pixel-to-pixel, channel-tochannel, layer-to-layer, and image-to-image variation in the activation distribution than image classification networks. However, existing BNNs for SR fail to capture these variations that contain rich information for image reconstruction, leading to inferior performance. To address this problem, we propose SCALES, a binarization method for SR networks that consists of the layerwise scaling factor, the spatial re-scaling method, and the channelwise re-scaling method, capturing the layer-wise, pixel-wise, and channel-wise variations efficiently in an input-dependent manner. We evaluate our method across different network architectures and datasets. For CNN-based SR networks, our binarization method SCALES outperforms the prior art method by 0.2dB with fewer parameters and operations. With SCALES, we achieve the first accurate binary Transformer-based SR network, improving PSNR by more than 1dB compared to the baseline method.
11:30 CEST TS10.7 POROS: ONE-LEVEL ARCHITECTURE-MAPPING CO-EXPLORATION FOR TENSOR ALGORITHMS
Speaker:
Fuyu Wang, Sun Yat-sen University, CN
Authors:
Fuyu Wang and Minghua Shen, Sun Yat-sen University, CN
Abstract
Tensor algorithms increasingly rely on specialized accelerators to meet growing performance and efficiency demands. Given the rapid evolution of these algorithms and the high cost of designing accelerators, automated solutions for jointly optimizing both architectures and mappings have gained attention. However, the joint design space is non-convex and non-smooth, hindering the finding of optimal or near-optimal designs. Moreover, prior work conducts two-level exploration, resulting in a combinatorial explosion. In this paper, we propose Poros, a one-level architecture-mapping co-exploration framework. Poros directly explores a batch of architecture-mapping configurations and evaluates their performance. It then exploits reinforcement learning to perform gradient-based search in the non-smooth joint design space. By sampling from the policy, Poros keeps exploring new actions to address non-convexity. Experimental results demonstrate that Poros achieves up to 5.32$ imes$ and 2.15$ imes$ better EDP compared with hand-designed accelerators and state-of-the-art automatic approaches respectively. Through one-level exploration scheme, Poros also converges at least 20\% faster than other approaches.
11:35 CEST TS10.8 A CNN COMPRESSION METHODOLOGY FOR LAYER-WISE RANK SELECTION CONSIDERING INTER-LAYER INTERACTIONS
Speaker:
Milad Kokhazadeh, School of Informatics, Aristotle University of Thessaloniki, GR
Authors:
Milad Kokhazadeh1, Georgios Keramidas2, Vasilios Kelefouras3 and Iakovos Stamoulis4
1PhD Candidate, Aristotle University of Thessaloniki, GR; 2Aristotle University of Thessaloniki/Think Silicon S.A., GR, GR; 3University of Plymouth, GB; 4Think Silicon, S.A. An Applied Materials Company, GR
Abstract
Convolutional Neural Networks (CNNs) achieve state-of-the-art performance across various application domains but are often resource-intensive, limiting their use on resource-constrained devices. Low-rank factorization (LRF) has emerged as a promising technique to reduce the computational complexity and memory footprint of CNNs, enabling efficient deployment without significant performance loss. However, challenges still remain in optimizing the rank selection problem, balancing memory reduction and accuracy, and integrating LRF into the training process of CNNs. In this paper, a novel and generic methodology for layer-wise rank selection is presented, considering inter-layer interactions. Our approach is compatible with any decomposition method and does not require additional retraining. The proposed methodology is evaluated in thirteen widely-used, CNN models, significantly reducing model parameters and Floating-Point Operations (FLOPs). In particular, our approach achieves up to a 94.6% parameter reduction (82.3% on average) and up to 90.7% FLOPs reduction (59.6% on average), with less than a 1.5% drop in validation accuracy, demonstrating superior performance and scalability compared to existing techniques.
11:40 CEST TS10.9 FINEQ: SOFTWARE-HARDWARE CO-DESIGN FOR LOW-BIT FINE-GRAINED MIXED-PRECISION QUANTIZATION OF LLMS
Speaker:
Xilong Xie, Beihang University, CN
Authors:
Xilong Xie1, Liang Wang1, Limin Xiao1, Meng Han1, Lin Sun2, Shuai Zheng1 and Xiangrong Xu1
1Beihang University, CN; 2Jiangsu Shuguang Optoelectric Co., Ltd., CN
Abstract
Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory consumption of LLMs. However, advanced single-precision quantization methods experience significant accuracy degradation when quantizing to ultra-low bits. Existing mixed-precision quantization methods are quantized by groups with coarse granularity. Employing high precision for group data leads to substantial memory overhead, whereas low precision severely impacts model accuracy. To address this issue, we propose FineQ, software-hardware co-design for low-bit fine-grained mixed-precision quantization of LLMs. First, FineQ partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters, thus achieving a balance between model accuracy and memory overhead. Then, we propose an outlier protection mechanism within clusters that uses 3 bits to represent outliers and introduce an encoding scheme for index and data concatenation to enable aligned memory access. Finally, we introduce an accelerator utilizing temporal coding that effectively supports the quantization algorithm while simplifying the multipliers in the systolic array. FineQ achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width. Meanwhile, the accelerator achieves up to 1.79× energy efficiency and reduces the area of the systolic array by 61.2\%.
11:45 CEST TS10.10 SOLVING THE COLD-START PROBLEM FOR THE EDGE: CLUSTERING AND ADAPTIVE DEEP LEARNING FOR EMOTION DETECTION
Speaker:
Junjiao Sun, Centro de Electrónica Industrial, Universidad Politécnica de Madrid, ES
Authors:
Junjiao Sun1, Laura Gutierrez Martin2, Jose Miranda Calero3, Celia López-Ongil2, Jorge Portilla1 and Jose Andres Otero Marnotes1
1Centro de Electrónica Industrial Universidad Politecnica de Madrid, ES; 2UC3M (Universidad Carlos III de Madrid), ES; 3EPFL, CH
Abstract
Designing AI-based applications personalized to each user's behavior presents significant challenges due to the cold start problem and the impracticality of extensive individual data labeling. These challenges are further compounded when deploying such applications at the edge, where limited computing resources constrain the design space. This paper introduces a novel approach to AI-driven personalized solutions in biosensing applications by combining deep learning with clustering-based separation techniques. The proposed Clustering and Learning for Emotion Adaptive Recognition (CLEAR) methodology strikes a balance between population-wide models and fully personalized systems by leveraging data-driven clustering. CLEAR demonstrates its effectiveness in emotion recognition tasks, and its integration with fine-tuning enables efficient deployment on edge devices, ensuring data privacy and real-time detection when new users are introduced to the system. We conducted experiments for model personalization on two edge computing platforms: the Coral Edge TPU Dev Board and the Raspberry Pi with an Intel Movidius Neural Compute Stick 2. The results show that initial cluster assignment for new users can be achieved without labeled data, directly addressing the cold-start problem. Compared to baseline validation without clustering, this proposal improves accuracy metric from 75% to 81.9%. Furthermore, fine-tuning with minimal labeled data significantly improves accuracy, achieving up to 86.34% for the fear detection task in the WEMAC dataset while remaining suitable for deployment on resource-constrained edge devices.
11:50 CEST TS10.11 KALMMIND: A CONFIGURABLE KALMAN FILTER DESIGN FRAMEWORK FOR EMBEDDED BRAIN-COMPUTER INTERFACES
Speaker:
Guy Eichler, Columbia University, Department of Computer Science, IL
Authors:
Guy Eichler, Joseph Zuckerman and Luca Carloni, Columbia University, US
Abstract
Kalman Filter (KF) is one of the most prominent algorithms to predict motion from measurements of brain activity. However, little effort has been made to optimize the KF for deployment in embedded brain-computer interfaces (BCIs). To address this challenge, we propose a new framework for designing KF hardware accelerators specialized for BCI, which facilitates design-space exploration by providing a tunable balance between latency and accuracy. Through FPGA-based experiments with brain data, we demonstrate improvements in both latency and accuracy compared to the state of the art.
11:51 CEST TS10.12 SEGTRANSFORMER: ENHANCING SOFTMAX PERFORMANCE THROUGH SEGMENTATION WITH A RERAM-BASED PIM ACCELERATOR
Speaker:
Ing-Chao Lin, National Cheng Kung University, TW
Authors:
YuCheng Wang1, Ing-Chao Lin1 and Yuan-Hao Chang2
1National Cheng Kung University, TW; 2Academia Sinica, TW | National Taiwan University, TW
Abstract
To accelerate Transformer computations, numerous ReRAM-based Processor-In-Memory (PIM) architectures have been proposed, which effectively speed up matrix multiplication. However, these approaches often shift the performance bottleneck from the attention mechanism to the Softmax computation. Additionally, data sharding for acceleration can disrupt the core logic of the Transformer, and when computing the exponential part of extremely small Euler's numbers, slight output differences lead to inefficiency in Softmax computation. To address these challenges, we propose SegTransformer, a ReRAM-based PIM accelerator that enhances matrix computation speed through segmentation techniques and generates segmented data for local Softmax operations. Moreover, we introduce an Integrated Softmax Processing Unit (ISPU), which computes both local Softmax and global factors to reduce errors and improve efficiency. Experimental results show that SegTransformer outperforms state-of-the-art Transformer accelerators.

LK02 ASD Lunchtime Keynote

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 13:15 CEST - 14:00 CEST

Time Label Presentation Title
Authors
13:15 CEST LK02.1 AI/ML AT THE FOREFRONT OF SEMICONDUCTOR EVOLUTION: ENHANCING DESIGN, EFFICIENCY, AND PERFORMANCE
Presenter:
Yankin Tanurhan, Synopsys, US
Author:
Yankin Tanurhan, Synopsys, US
Abstract
As artificial intelligence (AI) and machine learning (ML) drive innovation, their impact on the semiconductor market is transformative. This keynote will explore the latest AI/ML trends and their implications for SoC designs targeting high-performance compute, edge AI, and IoT applications. The presentation will cover AI/ML's role in developing next-generation semiconductor designs, including how AI/ML algorithms are incorporated into EDA tools to optimize chip design and enable efficient verification and manufacturing. Emerging AI/ML trends driving requirements for advanced neural processing units (NPU) will be explored, including generative AI applications like large language models and text-to-image generators. Finally, the role of transformer-based neural networks in implementing energy-efficient SoCs will be discussed.

ET03 Lifecycle Management of Emerging Memories: Why and How?

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST

Abstract:

Emerging memory technologies, such as Resistive RAM (ReRAM), Phase-Change Memory (PCM), Spin-Transfer Torque Magnetic Memory (STT-MRAM), and Ferroelectric FET (FeFET), receive a lot of interest both from academia and industry thanks to their attractive properties. These technologies can implement dense, fast, and non-volatile memories that can be used to efficiently store date as well as implement AI circuits. However, mass production is still limited, because these technologies suffer from quality and reliability issues that need to be addressed after manufacturing and during lifetime. These technologies are susceptible to new manufacturing defects due to new materials and structures as well as endurance problems. This tutorial presents a holistic view on the root causes of quality and reliability issues, their impact on the circuit’s behavior, and possible solutions to properly address these issues guaranteeing the required quality and reliability level. Finally, this tutorial allows attendees to understand the lifecycle management choices available to ensure high-quality and -reliable emerging memories.

Speakers:

Leticia Maria Bolzani Poehls, IHP – Leibniz Institute for High Performance Microelectronics - Germany

Moritz Fieback, Delft University of Technology, The Netherlands

Target audience:

This tutorial intends to be addressed to academia (from PhD students to postdocs) and professionals from industry that would like to know more about how to guarantee the quality of emerging memories and consequently their adoption in real applications. Around 40 participants are expected.

Learning objectives:

  • Describe why emerging memories need lifecycle management and how this holistic approach fits in the memories’ design process
  • Present and compare the lifecycle management of two different types of emerging memories including their quality and reliability issues and possible solutions
  • Summarize the key challenges that are involved in future lifecycle management for emerging memories

Required background:

  • Basic understanding of emerging memories and some general understanding of the definitions relate to the theory of test, and reliability.

Detailed program:

The proposed tutorial is based in the following plan:

  • Introduction: Why we need emerging memories?
  • Background: Why we need to adopt a lifecycle management approach for emerging memories?
  • Case study 1: Memory type, RRAMs
  • Case study 2: Memory type, STT-MRAMs
  • Comparison highlighting overlapping and differentiating features of two technologies
  • Conclusion & Future

FS10 Focus Session - GenAI-Native EDA: Redefining Verification with Large Language Models

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST

Session chair:
Pierre-Emmanuel Gaillardon, University of Utah, US

Organiser:
Pierre-Emmanuel Gaillardon, University of Utah, US

As hardware design processes grow in complexity and scale, verification methodologies reliant on human expertise and manual effort are increasingly insufficient to handle intricate interdependencies and challenging constraints across design stages. Generative AI (GenAI), particularly Large Language Models (LLMs), offers a breakthrough approach, enabling sophisticated pattern recognition, multimodal data integration, and adaptive learning to tackle these verification challenges. From early-stage Power, Performance, and Area (PPA) estimations to advanced anomaly detection and layout optimization, AI-driven tools are set to transform verification workflows. By synthesizing circuit representations across specifications, netlists, and physical layouts, these AI models promise not only enhanced verification precision but also significant reductions in time-to-market, making verification processes more scalable for next-generation design technologies. In this special session, attendees will explore cutting-edge GenAI applications for EDA with a focus on hardware verification, gaining insights into how advanced techniques like multimodal learning, LLM-based optimization, and multi-agent systems can boost verification accuracy. Discussions will highlight foundational shifts toward AI-native EDA tools and examine the potential of LLMs to automate and scale verification to meet the demands of increasingly complex hardware systems. Presentations will also cover AI- driven approaches to optimizing verification workflows and automating the detection of potential design flaws.

Time Label Presentation Title
Authors
14:00 CEST FS10.1 EDA-AWARE RTL GENERATION WITH LARGE LANGUAGE MODELS
Speaker:
Valerio Tenace, PrimisAI, US
Authors:
Mubashir Islam1, Humza Sami1, Pierre-Emmanuel Gaillardon2 and Valerio Tenace1
1PrimisAI, US; 2University of Utah -- PrimisAI, US
Abstract
Large Language Models (LLMs) have become increasingly popular for generating RTL code. How- ever, producing error-free RTL code in a zero-shot setting remains highly challenging even for state-of-the-art LLMs, often leading to issues that require manual, iterative refinement. This additional debugging process can dramatically increase the verification workload, underscoring the need for robust, automated correction mechanisms to ensure code correctness from the start. We will AIVRIL2, a self-verifying, LLM-agnostic agentic framework aimed at enhancing RTL code generation through iterative corrections of both syntax and functional errors. Our approach leverages a collaborative multi-agent system that incorporates feedback from error logs generated by EDA tools to automatically identify and resolve design flaws. Experimental results, conducted on the VerilogEval-Human benchmark suite, demonstrate that our framework significantly improves code quality, achieving nearly a 3.4× enhancement over prior methods. In the best-case scenario, functional pass rates of 77% for Verilog and 66% for VHDL were obtained, thus substantially improving the reliability of LLM-driven RTL code generation.

HSD02 HackTheSilicon DATE

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 18:00 CEST


LKS03 Later … with the keynote speakers

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:00 CEST


TS11 Architectural and microarchitectural design - 1

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST TS11.1 ACCELERATING AUTHENTICATED BLOCK CIPHERS VIA RISC-V CUSTOM CRYPTOGRAPHY INSTRUCTIONS
Speaker:
Yuhang Qiu, State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China, CN
Authors:
Qiu Yuhang, Wenming Li, Liu Tianyu, Wang Zhen, Zhang Zhiyuan, Fan Zhihua, Ye Xiaochun, Fan Dongrui and Tang Zhimin, State Key Lab of Processors, Institute of Computing Technology, CAS, CN
Abstract
As one of the standardized encryption algorithms, authenticated block ciphers based on Galois/Counter Mode (GCM) is a widely-used method to guarantee the accuracy and reliability in data transmission. Across the execution process of authenticated block ciphers the authentication operation is the main performance bottleneck because it introduces operations in high-dimensional Galois field (GF) which could not be efficiently executed via existing ISA. To overcome this problem, we propose a custom ISA extension and cooperate it with RISC-V cryptography extension to accelerate the whole process of authenticated block ciphers. Besides, we propose the specific hardware design including a fully-pipelined GF(2^128) multiplier to support the extended instructions and integrate it into the multi-issue out-of-order core XT910 without introducing any clock frequency overhead. The proposed design manages to accelerate the main operations in various kind of authenticated block ciphers. We compare the performance of our designs to other existing acceleration scheme based on RISC-V ISA extension. Experimental result shows that our design outperforms other related work and achieves up to 17x speedup with a lightweight hardware overhead.
14:05 CEST TS11.2 NDPAGE: EFFICIENT ADDRESS TRANSLATION FOR NEAR-DATA PROCESSING ARCHITECTURES VIA TAILORED PAGE TABLE
Speaker:
Qingcai Jiang, University of Science and Technology of China, CN
Authors:
Qingcai Jiang, Buxin Tu and Hong An, University of Science and Technology of China, CN
Abstract
Near-Data Processing (NDP) has been a promising architectural paradigm to address the memory wall problem for data-intensive applications. Practical implementation of NDP architectures calls for system support for better programmability, where having virtual memory (VM) is critical. Modern computing systems incorporate a 4-level page table design to support address translation in VM. However, simply adopting an existing 4-level page table design in NDP systems causes significant address translation overhead because (1) NDP applications generate a lot of address translation requests, and (2) the limited L1 cache in NDP systems cannot cover the accesses to page table entries (PTEs). We extensively analyze the 4-level page table design and observe that (1) the memory access to page table entries is highly irregular, thus cannot benefit from the L1 cache, and (2) the last two levels of page tables are nearly fully occupied. Based on our observations, we propose NDPage, an efficient page table design tailored for NDP systems. The key mechanisms of NDPage are (1) an L1 cache bypass mechanism for PTEs that not only accelerates the memory accesses of PTEs but also prevents the pollution of PTEs in the cache system, and (2) a flattened page table design that merges the last two levels of page tables, allowing the page table to enjoy the flexibility of a 4KB page while reducing the number of PTE accesses. We evaluate NDPage using a variety of data-intensive workloads. Our evaluation shows that in a single-core NDP system, NDPage improves the end-to-end performance over the state-of-the-art address translation mechanism of 14.3\%; in 4-core and 8-core NDP systems, NDPage enhances the performance of 9.8\% and 30.5\%, respectively.
14:10 CEST TS11.3 SPIRE: INFERRING HARDWARE BOTTLENECKS FROM PERFORMANCE COUNTER DATA
Speaker:
Nicholas Wendt, University of Michigan, US
Authors:
Nicholas Wendt1, Mahesh Ketkar2 and Valeria Bertacco1
1University of Michigan, US; 2Intel Labs, US
Abstract
The persistent demand for greater computing efficiency, coupled with diminishing returns from semiconductor scaling, has led to increased microarchitecture complexity and diversity. Thus, it has become increasingly difficult for application developers and hardware architects to accurately identify low-level performance bottlenecks. Abstract performance models, such as roofline models, help but strip away important microarchitectural details. In contrast, analyses based on hardware performance counters preserve detail but are challenging to implement. This work proposes SPIRE, a novel performance model that combines the accessibility and generality of roofline models with the microarchitectural detail of performance counters. SPIRE ([S]tatistical [P]iecewise L[i]near [R]oofline [E]nsemble) uses a collection of roofline models to estimate a processor's maximum throughput, based on data from its performance counters. Training this ensemble simply requires sampling data from a processor's performance counters. After training a SPIRE model on 23 workloads running on a CPU, we evaluated it with 4 new workloads and compared our findings against a commercial performance analysis tool. We found that our SPIRE analysis accurately identified many of the same bottlenecks while requiring minimal deployment effort.
14:15 CEST TS11.4 IMPROVING ADDRESS TRANSLATION IN TAGLESS DRAM CACHE BY CACHING PTE PAGES
Speaker:
Osang Kwon, Sungkyunkwan University, KR
Authors:
Osang Kwon, Yongho Lee and Seokin Hong, Sungkyunkwan University, KR
Abstract
This paper proposes a novel caching mechanism for PTE pages to enhance the Tagless DRAM Cache architecture and improve address translation in large in-package DRAM caches. Existing OS-managed DRAM cache architectures have achieved significant performance improvements by focusing on efficient tag management. However, prior studies have been limited in that they only update the PTE after caching pages, without directly accessing PTEs from the DRAM cache. This limitation leads to performance degradation during page walks. To address this issue, we propose a method to copy both data pages and PTE pages simultaneously to the DRAM cache. This approach reduces address translation and cache access latency. Additionally, we introduce a shootdown mechanism to maintain the consistency of PTEs and page walk caches in multi-core systems, ensuring that all cores access the latest information for shared pages. Experimental results demonstrate that the proposed Caching PTE pages can reduce address translation overhead by up to 33.3% compared to traditional OS-managed tagless DRAM caches, improving overall program execution time by an average of 10.5%. This effectively mitigates bottlenecks caused by address translation.
14:20 CEST TS11.5 EXPLORING THE SPARSITY-QUANTIZATION INTERPLAY ON A NOVEL HYBRID SNN EVENT-DRIVEN ARCHITECTURE
Speaker:
Tosiron Adegbija, University of Arizona, US
Authors:
Ilkin Aliyev, Jesus Lopez and Tosiron Adegbija, University of Arizona, US
Abstract
Spiking Neural Networks (SNNs) offer potential advantages in energy efficiency but currently trail Artificial Neural Networks (ANNs) in versatility, largely due to challenges in efficient input encoding. Recent work shows that direct coding achieves superior accuracy with fewer timesteps than traditional rate coding. However, there is a lack of specialized hardware to fully exploit the potential of direct-coded SNNs, especially their mix of dense and sparse layers. This work proposes the first hybrid inference architecture for direct-coded SNNs. The proposed hardware architecture comprises a dense core to efficiently process the input layer and sparse cores optimized for event-driven spiking convolutions. Furthermore, for the first time, we investigate and quantify the quantization effect on sparsity. Our experiments on two variations of the VGG9 network and implemented on a Xilinx Virtex UltraScale+ FPGA (Field-Programmable Gate Array) reveal two novel findings. Firstly, quantization increases the network sparsity by up to 15.2% with minimal loss of accuracy. Combined with the inherent low power benefits, this leads to a 3.4x improvement in energy compared to the full-precision version. Secondly, direct coding outperforms rate coding, achieving a 10% improvement in accuracy and consuming 26.4x less energy per image. Overall, our accelerator achieves ~51x higher throughput and consumes half the power compared to previous work. Our accelerator code is available at: https://github.com/githubofaliyev/SNN-DSE/tree/DATE25.
14:25 CEST TS11.6 SWIFT-SIM: A MODULAR AND HYBRID GPU ARCHITECTURE SIMULATION FRAMEWORK
Speaker:
Xiangrong Xu, Beihang University, CN
Authors:
Xiangrong Xu, Yuanqiu Lv, Liang Wang, Limin Xiao, Meng Han, Runnan Shen and Jinquan Wang, Beihang University, CN
Abstract
Simulation tools are critical for architects to quickly estimate the impact of aggressive new features of GPU architecture. Existing cycle-accurate GPU simulators are typically cumbersome and slow to run. We observe that it is time-consuming and unnecessary for cycle-accurate GPU simulators to perform detailed simulations for the entire GPU when exploring the design space of specific components. This paper proposes Swift-Sim, a modular and hybrid GPU simulation framework. With a highly modular design, our framework can choose appropriate modeling approaches for each component according to requirements. For components of interest to architects, we use cycle-accurate simulation to evaluate new GPU architectures. For other components, we use analytical modeling, which accelerates simulation speed with only minor and acceptable degradation in overall accuracy. Based on this simulation framework, we present two working examples of hybrid modeling that simulate the ALU pipeline and memory accesses using analytical models. We further implement two GPU performance simulators with different levels of simplification based on Swift-Sim and evaluate them using configurations from real GPUs. The results show that the two simulators achieve an 82.6x and 211.2x geometric mean speedup compared to Accel-Sim with insignificant accuracy degradation.
14:30 CEST TS11.7 HYMM: A HYBRID SPARSE-DENSE MATRIX MULTIPLICATION ACCELERATOR FOR GCNS
Speaker:
Hunjong Lee, Korea University, KR
Authors:
Hunjong Lee1, Jihun Lee1, Jaewon Seo1, Yunho Oh1, Myungkuk Yoon2 and Gunjae Koo1
1Korea University, KR; 2Ewha Womans University, KR
Abstract
Graph convolutional networks (GCNs) are emerging neural network models designed to process graph-structured data. Due to massively parallel computations using irregular data structures by GCNs, traditional processors such as CPUs, GPUs, and TPUs exhibit significant inefficiency when performing GCN inferences. Even though researchers have proposed several GCN accelerators, the prior dataflow architectures struggle with inefficient data utilization due to the divergent and irregularly structured graph data. In order to overcome such performance hurdles, we propose a hybrid dataflow architecture for sparse-dense matrix multiplications (SpDeMMs), called HyMM. HyMM employs disparate dataflow architectures using different data formats to achieve more efficient data reuse across varying degree levels within graph structures, hence HyMM can reduce off-chip memory accesses significantly. We implement the cycle-accurate simulator to evaluate the performance of HyMM. Our evaluation results demonstrate HyMM can achieve up to 4.78x performance uplift by reducing off-chip memory accesses by 91% compared to the conventional non-hybrid dataflow.
14:35 CEST TS11.8 BUDDY ECC: MAKING CACHE MOSTLY CLEAN IN CXL-BASED MEMORY SYSTEMS FOR ENHANCED ERROR CORRECTION AT LOW COST
Speaker:
Yongho Lee, Sungkyunkwan University, KR
Authors:
Yongho Lee, Junbum Park, Osang Kwon, Sungbin Jang and Seokin Hong, Sungkyunkwan University, KR
Abstract
As Compute Express Link (CXL) emerges as a key memory interconnect, interest in optimization opportunities and challenges has grown. However, due to the different characteristics of the CXL Memory Module (CMM) compared to traditional DRAM-based Dual In-line Memory Modules (DIMMs), existing optimizations may not be effectively applied. In this paper, we propose an Proactively Write-back Policy that leverages the full-duplex nature and features of the CMM to optimize bandwidth, enhance reliability, and reduce area overhead. First, the Proactively Write-back improves bandwidth efficiency by minimizing dirty cachelines in the last-level cache through dead block prediction, proactively identifying and writing back cachelines that are unlikely to be rewritten. Second, the Utilization-aware Policy dynamically monitors the internal bandwidth of the CMM, sending write-back requests only when the module is under low load rate, thus preventing performance degradation during high traffic. Finally, the robust Buddy ECC scheme enhances data reliability by separating Error Detection Code (EDC) for clean cachelines and stronger Error Correction Code (ECC) for dirty cachelines. Buddy ECC improved bandwidth utilization by 46%, limited performance degradation to 0.33%, and kept energy consumption increase under 1%.
14:40 CEST TS11.9 A PERFORMANCE ANALYSIS OF CHIPLET-BASED SYSTEMS
Speaker:
Neethu Bal Mallya, Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, SE
Authors:
Neethu Bal Mallya, Panagiotis Strikos, Bhavishya Goel, Ahsen Ejaz and Ioannis Sourdis, Chalmers University of Technology, SE
Abstract
As the semiconductor industry struggles to keep Moore's law alive and integrate more functionality on a chip, multi-chiplet chips offer a lower cost alternative to large monolithic chips due to their higher yield. However, chiplet-based chips are naturally Non-Uniform Memory Access (NUMA) systems and therefore suffer from slow remote accesses. NUMA overheads are exacerbated by the limited throughput and higher latency of inter-chiplet communication. This paper offers a comprehensive analysis of chiplet-based systems with different design parameters measuring their performance overheads compared to traditional monolithic multicore designs and their scalability to system and chiplet size. Several design alternatives pertaining to the memory hierarchy, interconnects, and technology aspects are studied. Our analysis shows that although chiplet-based chips can cut (recurring engineering) costs to half, they may give away over a third of the monolithic performance. Part of this performance overhead can be regained with specific design choices.
14:45 CEST TS11.10 A HIGH-PERFORMANCE AND FLEXIBLE ACCELERATOR FOR DYNAMIC GRAPH CONVOLUTIONAL NETWORKS
Speaker:
Ke Wang, University of North Carolina at Charlotte, US
Authors:
Yingnan Zhao1, Ke Wang2 and Ahmed Louri1
1The George Washington University, US; 2University of North Carolina at Charlotte, US
Abstract
Dynamic Graph Convolutional Networks (DGCNs) have been applied to various dynamic graph-related applications, such as social networks, to achieve high inference accuracy. Typically, each DGCN layer consists of two distinct modules: a Graph Convolutional Network (GCN) module that captures spatial information, and a Recurrent Neural Network (RNN) module that extracts temporal information from input dynamic graphs. The different functionalities of these modules pose significant challenges for hardware platforms, particularly in achieving high-performance and energy-efficient inference processing. To this end, this paper introduces HiFlex, a high-performance and flexible accelerator designed for DGCN inference. At the architecture level, HiFlex implements multiple homogeneous processing elements (PEs) to perform main computations for GCN and RNN modules, along with a versatile interconnection fabric to optimize data communication and enhance on-chip data reuse efficiency. The flexible interconnection fabric can be dynamically configured to provide various on-chip topologies, supporting point-to-point and multicast communication patterns needed for GCN and RNN processing. At the algorithm level, HiFlex introduces a dynamic control policy that partitions, allocates, and configures hardware resources for distinct modules based on their computational requirements. Evaluation results using real-world dynamic graphs demonstrate that HiFlex achieves, on average, a 38% reduction in execution time and a 42% decrease in energy consumption for DGCN inference, compared to state-of-the-art approaches such as ES-DGCN, ReaDy, and RACE.
14:50 CEST TS11.11 AMPHI: PRACTICAL AND INTELLIGENT DATA PREFETCHING FOR THE FIRST-LEVEL CACHE
Speaker:
Zicong Wang, College of Computer Science and Technology, National University of Defense Technology, CN
Authors:
Xuan Tang, Zicong Wang, Shuiyi He, Dezun Dong and Xiangke Liao, National University of Defense Technology, CN
Abstract
Data prefetchers play a crucial role in alleviating the memory wall by predicting future memory accesses. First-level cache prefetchers can observe all memory instructions but often rely on simpler strategies due to limited resources. While emerging machine learning-based approaches cover more memory access patterns, they typically require higher computational and storage resources and are usually deployed in the last-level cache. Other intelligent solutions for the first-level cache show only modest performance gains. To address this, we propose Amphi, the first practical and intelligent data prefetcher specifically designed for the first-level cache. Applying a binarized temporal convolutional network, Amphi significantly reduces storage overhead while maintaining performance comparable to the SOTA intelligent prefetcher. With a storage overhead of only 3.4 KB, Amphi requires only one-eighth of Pythia's storage needs. Amphi paves the way for the broader adoption of intelligence-driven prefetching solutions.

TS12 Smart and Autonomous Systems for a Smart World

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST TS12.1 SACK: ENABLING ENVIRONMENTAL SITUATION-AWARE ACCESS CONTROL FOR AUTONOMOUS VEHICLES IN LINUX KERNEL
Speaker:
Boyan Chen, Peking University, CN
Authors:
Boyan Chen1, Qingni Shen1, Lei Xue2, Jiarui She1, Xiaolei Zhang1, Xiapu Luo3, Xin Zhang1, Wei Chen1 and Zhonghai Wu1
1Peking University, CN; 2Sun Yat-Sen University, CN; 3The Hong Kong Polytechnic Unversity, HK
Abstract
Connected and autonomous vehicles (CAVs) operate in open and evolving environments, which require timely and adaptive permission restriction to address dynamic risks that arise from changes in environmental situations (hereinafter referred to as situations), such as emergency situations due to vehicle crashes. Enforcing situation-aware access control is an effective approach to support adaptive permission restriction. Current works mainly implement situation-aware access control in the permission framework and API monitoring in user space. They are vulnerable to being bypassed and are coarse-grained. Autonomous systems have widely adopted mandatory access control (MAC) to configure and enforce system-wide and fine-grained access control policies. However, the MAC supported by Linux security modules (LSM) relies on pre-defined security contexts (e.g., type) and relatively fixed permission transition conditions (e.g., exec syscall), which lacks consideration of environmental factors. To address these issues, we propose a Situation-aware Access Control framework in the Kernel (SACK), which enforces adaptive permission restriction based on environmental factors for CAVs. Incorporating environmental situations into the LSM framework is not straightforward. SACK introduces situation states as a new security context for abstracting environmental factors in the kernel. Subsequently, SACK utilizes a situation state machine to implement new adaptive permission transitions triggered by situation events. In addition, SACK provides a novel situation-aware policy language that links specific user space permissions to MAC rules while maintaining compatibility with other LSMs such as AppArmor. We develop two prototypes: an independent SACK with its own policies and a SACK-enhanced AppArmor that adaptively updates the corresponding policies of AppArmor. The experimental results demonstrate that SACK can efficiently enforce situation-adaptive permissions with negligible runtime overhead.
14:05 CEST TS12.2 EXPLOITING SYSML V2 MODELING FOR AUTOMATIC SMART FACTORIES CONFIGURATION
Speaker:
Mario Libro, Università di Verona, IT
Authors:
Mario Libro1, Sebastiano Gaiardelli1, Marco Panato2, Stefano Spellini2, Michele Lora1 and Franco Fummi1
1Università di Verona, IT; 2Factoryal S.r.l., IT
Abstract
Smart factories are complex environments equipped with both production machinery and computing devices that collect, share, and analyze data. For this reason, the modeling of today's factories can no longer rely on traditional methods, and computer engineering tools, such as SysML, must be employed. At the same time, the current SysML v1.* standard does not provide the rigorousness required to model the complexity and the criticalities of a smart factory. Recently, SysML v2 has been proposed and is about to be released as the new version of the standard. Its release candidate version shows the new version aims at providing a more rigorous and complete modeling language, able to fulfill the requirements of the smart factory domain. In this paper, we explore the capabilities of the new SysML v2 standard by building a rigorous modeling strategy, able to capture the aspects of a smart factory related to the production process, the computation, and the communication. We apply the proposed strategy to model a fully-fledged smart factory, and we rely on models to automatically configure the different pieces of equipment and software components in the factory.
14:10 CEST TS12.3 HIDP: HIERARCHICAL DNN PARTITIONING FOR DISTRIBUTED INFERENCE ON HETEROGENEOUS EDGE PLATFORMS
Speaker:
Zain Taufique, University of Turku, FI
Authors:
Zain Taufique1, Aman Vyas1, Antonio Miele2, Pasi Liljeberg1 and Anil Kanduri1
1University of Turku, FI; 2Politecnico di Milano, IT
Abstract
Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches.
14:15 CEST TS12.4 COUPLING NEURAL NETWORKS AND PHYSICS EQUATIONS FOR LI-ION BATTERY STATE-OF-CHARGE PREDICTION
Speaker:
Giovanni Pollo, Politecnico di Torino, IT
Authors:
Giovanni Pollo1, Alessio Burrello2, Enrico Macii1, Massimo Poncino1, Sara Vinco1 and Daniele Jahier Pagliari1
1Politecnico di Torino, IT; 2Politecnico di Torino | Università di Bologna, IT
Abstract
Estimating the evolution of the battery's State of Charge (SoC) in response to its usage is critical for implementing effective power management policies and for ultimately improving the system's lifetime. Most existing estimation methods are either physics-based digital twins of the battery or data-driven models such as Neural Networks (NNs). In this work, we propose two new contributions in this domain. First, we introduce a novel NN architecture formed by two cascaded branches: one to predict the current SoC based on sensor readings, and one to estimate the SoC at a future time as a function of the load behavior. Second, we integrate battery dynamics equations into the training of our NN, merging the physics-based and data-driven approaches, to improve the models' generalization over variable prediction horizons. We validate our approach on two publicly accessible datasets, showing that our Physics-Informed Neural Networks (PINNs) outperform purely data-driven ones while also obtaining superior prediction accuracy with a smaller architecture with respect to the state-of-the-art.
14:20 CEST TS12.5 AUTONOMOUS UAV-ASSISTED IOT SYSTEMS WITH DEEP REINFORCEMENT LEARNING BASED DATA FERRY
Speaker:
Mason Conkel, The University of Texas at San Antonio, US
Authors:
Mason Conkel1, Wen Zhang2, Mimi Xie1, Yufang Jin1 and Chen Pan1
1The University of Texas at San Antonio, US; 2Wright State University, US
Abstract
Emerging unmanned aerial vehicle (UAV) technology offers reliable, flexible, and controllable techniques for transferring data collected by wireless internet of things (IoT) devices located in remote areas. However, deploying UAVs faces limitations in mission distance to recharging, especially when recharge occurs far from the monitoring. To address these challenges, we propose smart charging stations installed within the monitoring area equipped with energy-harvest features and communication modules. These stations can replenish the UAV's energy and act as cluster heads by collecting information from IoT devices within their jurisdiction. This allows a UAV to operate continuously by downloading while charging and forwarding the data to the remote server during flight. Despite these improvements, the unpredictable nature of energy-harvest devices and charging needs can lead to stale or obsolete information at cluster heads. The limited communication range may prevent the cluster heads from establishing connections with all nodes in their jurisdiction. To overcome these issues, we proposed an age-of-information-aware data ferry algorithm using deep reinforcement learning to determine the UAV's flight path. The deep reinforcement learning agent, running on cluster heads, utilizes a global state gathered by the UAV to output the location of the next stop, which can be a cluster head or an IoT device. The experiments show that the algorithm can minimize the age of information without diminishing data collection.
14:25 CEST TS12.6 AERODIFFUSION: COMPLEX AERIAL IMAGE SYNTHESIS WITH DYNAMIC TEXT DESCRIPTIONS AND FEATURE-AUGMENTED DIFFUSION MODELS
Speaker:
Douglas Townsell, Wright State University, US
Authors:
Douglas Townsell1, Mimi Xie2, Bin Wang1, Fathi Amsaad1, Varshitha Thanam3 and Wen Zhang1
1Wright State University, US; 2The University of Texas at San Antonio, US; 3wright state university, US
Abstract
Aerial imagery provides crucial insights for various fields, including remote monitoring, environmental assessment, and autonomous navigation. However, the availability of aerial image datasets is limited due to privacy concerns and imbalanced data distribution, impeding the development of robust deep learning models. Recent advancements in text-guided image synthesis offer a promising approach to enrich and diversify these datasets. Despite progress, existing generative models face challenges in synthesizing realistic aerial images due to the lack of paired text-aerial datasets, the complexity of densely packed objects, and the limitations of modeling object relationships. In this paper, we introduce extbf{{proposedmodel}}, a novel framework designed to overcome these challenges by leveraging large language models (LLMs) for keypoint-aware text description generation and a feature-augmented diffusion process for realistic image synthesis. Our approach integrates region-level feature extraction to preserve small objects and multi-modal feature alignment to improve textual descriptions of complex aerial scenes. extbf{{proposedmodel}} is the first to extend deep generative models for high-resolution, text-guided aerial image generation, including the creation of images from novel viewpoints. We contribute a new paired text-aerial image dataset and demonstrate the effectiveness of our model, achieving an FID score of 78.15 across five benchmarks, significantly outperforming state-of-the-art models such as DDPM (217.95), Stable Diffusion (119.13), and ARLDM (111.59).
14:30 CEST TS12.7 POWER- AND DEADLINE-AWARE DYNAMIC INFERENCE ON INTERMITTENT COMPUTING SYSTEMS
Speaker:
Hengrui Zhao, University of Southampton, GB
Authors:
Hengrui Zhao, Lei Xun, Jagmohan Chauhan and Geoff Merrett, University of Southampton, GB
Abstract
In energy-harvesting intermittent computing systems, balancing power constraints with the need for timely and accurate inference remains a critical challenge. Existing methods often sacrifice significant accuracy or fail to adapt effectively to fluctuating power conditions. This paper presents DualAdaptNet, a power- and deadline-aware neural network architecture that dynamically adapts both its width and depth to ensure reliable inference under variable power conditions. Additionally, a runtime scheduling method is introduced to select an appropriate sub-network configuration based on real-time energy-harvesting conditions and system deadlines. Experimental results on the MNIST dataset demonstrate that our approach completes up to 7.0% more inference tasks within a specified deadline while also improving average accuracy by 15.4% compared to the state-of-the-art.
14:35 CEST TS12.8 DCHA: DISTRIBUTED-CENTRALIZED HETEROGENEOUS ARCHITECTURE ENABLES EFFICIENT MULTI-TASK PROCESSING FOR SMART SENSING
Speaker:
Cheng Qu, Beijing University of Posts and Telecommunication, CN
Authors:
Erxiang Ren1, Cheng Qu2, Li Luo1, Yonghua Li2, Zheyu Liu3, Xinghua Yang4, Qi Wei5 and Fei Qiao5
1Beijing Jiaotong University, CN; 2Beijing University of Posts and Telecommunications, CN; 3MakeSens AI, CN; 4Beijing Forestry University, CN; 5Tsinghua University, CN
Abstract
The rapid development of artificial intelligence (AI) has accelerated the progression of IoT technology into the smart era. Integrating AI processing capabilities into IoT devices to create smart sensing systems holds significant promise. In this work, we propose a distributed-centralized heterogeneous architecture that enables efficient multi-task processing for smart sensing. This architecture improves the operational efficiency of sensing systems and enhances the deployment scalability through collaborative computing across end, edge, and center nodes. Specifically, we partition the network in traditional centralized sensing systems into several parts and perform algorithm-hardware co-design for each part on its respective deployment platform. We developed a sample design to validate the proposed architecture. By implementing a lightweight image encoder, we achieved an 88x reduction in encoder parameters and up to 9873x energy gain, facilitating deployment on resource-constrained devices. Experimental results demonstrate that the proposed architecture effectively reduces overall energy consumption by 0.0573x to 0.0889x, while maintaining robust multi-task inference capabilities. Moreover, energy consumption reductions of 2.88x to 3.22x on edge nodes and 6311.56x to 10037.23x on end nodes were observed.
14:40 CEST TS12.9 FAIRXBAR: IMPROVING THE FAIRNESS OF DEEP NEURAL NETWORKS WITH NON-IDEAL IN-MEMORY COMPUTING HARDWARE
Speaker:
Cheng Wang, Iowa State University of Science and Technology, US
Authors:
Sohan Salahuddin Mugdho1, Yuanbo Guo2, Ethan Rogers1, Weiwei Zhao1, Yiyu Shi2 and Cheng Wang1
1Iowa State University of Science and Technology, US; 2University of Notre Dame, US
Abstract
While artificial intelligence (AI) based on deep neural networks (DNN) has achieved near-human performance in various cognitive tasks, such data-driven models are known to exhibit implicit bias against specific subgroups, leading to fairness issues. Most existing methods for improving model fairness only consider software-based optimizations, while the impact of hardware is largely unexplored. In this work, we investigate the impact of underlying hardware technology on AI fairness as we deploy DNN-based medical diagnosis algorithms onto in-memory computing hardware accelerators. Based on our newly developed framework that characterizes the importance of DNN weight parameters to fairness, we demonstrate that device variability-induced non-idealities such as stuck-at faults and noises due to variation can be exploited to deliver improved fairness (up to 32% improvement) with significantly reduced trade-off (less than 1% loss) of the overall accuracy. We additionally develop a hardware non-idealities-aware training methodology that further mitigates the bias between unprivileged and privileged demographic groups in our experiments on skin lesion diagnosis datasets. Our work suggests exciting opportunities for leveraging the hardware attributes in a cross-layer co-design to enable equitable and fair AI.
14:45 CEST TS12.10 HUMAN-CENTERED DIGITAL TWIN FOR INDUSTRY 5.0
Speaker:
Francesco Biondani, Università di Verona, IT
Authors:
Francesco Biondani1, Luigi Capogrosso1, Nicola Dall'Ora1, Enrico Fraccaroli2, Marco Cristani1 and Franco Fummi1
1Università di Verona, IT; 2University of North Carolina at Chapel Hill, IT
Abstract
Moving beyond the automation-driven paradigm of Industry 4.0, Industry 5.0 emphasizes human-centric industrial systems where human creativity and instincts complement precise and advanced machines. With this new paradigm, there is a growing need for resource-efficient and user-preferred manufacturing solutions that integrate humans into industrial processes. Unfortunately, methodologies for incorporating human elements into industrial processes remain underdeveloped. In this work, we present the first pipeline for the creation of a human-centered Digital Twin (DT), leveraging Unreal Engine's MetaHuman technology to track worker alertness in real-time. Our findings demonstrate the potential of integrating Artificial Intelligence (AI) and human-centered design within Industry 5.0 to enhance both worker safety and industrial efficiency.
14:46 CEST TS12.11 ENERGY-AWARE ERROR CORRECTION METHOD FOR INDOOR POSITIONING AND TRACKING
Speaker:
Donkyu Baek, Chungbuk National University, KR
Authors:
Donguk Kim1, Yukai Chen2, Donkyu Baek1, Enrico Macii3 and Massimo Poncino3
1Chungbuk National University, KR; 2IMEC, BE; 3Politecnico di Torino, IT
Abstract
Indoor positioning is crucial for the effective use of drones in smart environments, enabling precise navigation and control in complex indoor spaces where GPS signals are weak or unavailable and wireless communication-based systems must be used. In order to improve positioning accuracy, various distance measurement techniques and related error correction methods have been proposed in the literature. However, these methods are mostly focused on accuracy and often require a significant amount of computational resources, which is quite inefficient when deployed on battery-operated devices like small robots or drones because of their limited battery capacity. Moreover, conventional error correction methods are little effective for the tracking of moving objects. In this paper, we first analyze the trade-off between energy consumption and accuracy for the error correction and identify the most energy-efficient error correction method. Based on this analysis in the accuracy/energy space, we introduce a new energy-efficient error correction method that is especially targeted for tracking a moving object. We validated our solution by implementing an Ultra-Wideband based indoor positioning system and demonstrated that the proposed method improves positioning accuracy by 15% and reduces energy consumption by 33% compared to the state-of-the-art method.
14:47 CEST TS12.12 DECENTRALIZING IOT DATA PROCESSING: THE RISE OF BLOCKCHAIN-BASED SOLUTIONS
Speaker:
Daniela De Venuto, Polytechnic University of Bari, IT
Authors:
Giuseppe Spadavecchia1, Marco Fiore2, Marina Mongiello2 and Daniela De Venuto2
1Private, IT; 2Polytechnic University of Bari, IT
Abstract
The rise of the Internet of Things has introduced new challenges related to data security and transparency, especially in industries like agri-food where traceability is critical. Traditional cloud-based solutions, while scalable, pose security and privacy risks. This paper proposes a decentralized architecture using Blockchain technology to address these challenges. We deploy IoT sensors connected to a Raspberry Pi for edge processing and utilize Hyperledger Fabric, a private Blockchain, to manage and store data securely. Two approaches were evaluated: computation of a Discomfort Index on the Raspberry Pi (edge processing) versus performing the same computation on-chain using smart contracts. Performance metrics, including latency, throughput, and error rate, were measured using Hyperledger Caliper. The results show that edge processing offers superior performance in terms of latency and throughput, while Blockchain-based computation ensures greater transparency and trust. This study highlights the potential of Blockchain as a viable alternative to centralized cloud systems in IoT environments and suggests future research in scalability, hybrid architectures, and energy efficiency.
14:48 CEST TS12.13 ENABLING A PORTABLE BRAIN COMPUTER INTERFACE FOR REHABILITATION OF SPINAL CORD INJURIES
Speaker:
Adrian Evans, CEA, FR
Authors:
Adrian Evans1, Victor Roux-Sibillon2, Joe Saad2, Ivan Miro-Panades2, Tetiana Aksenova3 and Lorena Anghel4
1CEA, FR; 2CEA-List, FR; 3CEA-Leti, FR; 4Grenoble-Alpes University, Grenoble, France, FR
Abstract
In clinical trials, brain signal decoders combined with spinal stimulation have shown to be a promising means to restore mobility to paraplegic and tetraplegic patients. To make this technology available for home use, the complex brain signal decoding must be performed using a low-power, portable battery operated system. This case study shows how the decoding algorithm for a Brain-Computer Interface (BCI) system was ported to an embedded platform, resulting in an over 25× power reduction, compared to the previous implementation, while respecting real-time and accuracy constraints.

W05 OSSMPIC - Open Source Solutions for Massively Parallel Integrated Circuits

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 18:00 CEST


W06 Cross-stack Explorations of Ferroelectric-based Logic and Memory Solutions for At-Scale Compute Workloads

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 18:00 CEST


W08 ASD Workshop “How to supervise autonomy?”

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 18:00 CEST

Autonomous systems are on their way from an exotic system species to a mainstream technology, were they even reach safety critical and high assurance applications. Yet, efficient design concepts providing the required behavioral guarantees while keeping the benefits of autonomous intelligence are still an open topic, in theory and even more so in engineering practice. Some approaches rely on centralized guidance via infrastructure, others extend individual component capabilities by protective, often model based, functions. Another differentiation is the use of human support in unclear situations, such as in level 4 vehicle automation, vs. independent management with function degradation (e.g. safety layers). The workshop plans to provide examples from very different areas, such as road traffic, UAVs, human assistive robotics, and facility management. The design concepts have a high societal and economic relevance, including legal aspects, such as certification and liability.


FS02 Focus Session - AI-Driven Design Evolution: Benchmarking and Infrastructure for the Next Era of Semiconductors and Photonics

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST

Session chair:
Anthony Agnesina, NVIDIA Corp., US

Session co-chair:
Hao Geng, Shanghai Tech University, CN

Organisers:
Haoyu Yang, NVIDIA Corp., US
Yuzhe Ma, Hong Kong University of Sicence and Technology (GZ), CN

"As AI and machine learning models become increasingly integrated into semiconductor and photonic design workflows, the need for rigorous benchmarking, robust datasets, and scalable infrastructures is paramount. This special session presents pioneering research on evaluating AI capabilities across digital hardware, formal verification, and photonic device design, with a strong focus on the importance of benchmark frameworks and dataset development. The session will feature four key talks: 1) ChipVQA is a benchmark designed to evaluate visual language models (VLMs) in chip design, requiring a visual understanding of diagrams and schematics across five disciplines. Current models, including GPT-4o, struggle with domain-specific tasks, while a novel agent-based approach shows potential for improved performance. 2) FVEval is a comprehensive benchmark designed to evaluate large language models (LLMs) in formal verification tasks for digital chip design. It assesses LLMs' abilities to generate SystemVerilog assertions and reason about design RTL. The benchmark includes both expert-written and synthetic examples, offering insights into current LLM capabilities and potential for improving formal verification productivity. 3) MAPS introduces an open-source infrastructure to standardize AI-based solvers for photonic device simulation and inverse design. It provides a rich dataset, a neural operator model zoo for training, and a scalable framework for benchmarking AI-based photonic simulators. MAPS aims to accelerate innovation in photonic hardware by bridging the gap between AI-driven physics simulations and photonic design optimization. 4) PICEval introduces a benchmark to evaluate large language models (LLMs) for automating the design of photonic integrated circuits (PICs). The benchmark spans device- to circuit-level designs and assesses the functionality and fidelity of LLM-generated netlists by comparing them to expert-written solutions. It highlights the challenges and potential of LLMs in automating PIC design and identifies areas for further research to optimize their application. Together, these talks underscore the crucial role of benchmarks, datasets, and scalable infrastructure in advancing AI for chip and photonic design, shaping the future of automated and intelligent design workflows."

Time Label Presentation Title
Authors
16:30 CEST FS02.1 CHIPVQA: BENCHMARKING VISUAL LANGUAGE MODELS FOR CHIP DESIGN
Speaker:
Haoyu Yang, NVIDIA Corp., US
Authors:
Haoyu Yang, Qijing Huang, Nathaniel Pinckney, Walker Turner, Wenfei Zhou, Yanqing Zhang, Chia-Tung Ho, Chen-Chia Chang and Haoxing Ren, NVIDIA Corp., US
Abstract
Large-language models (LLMs) have shown great potential in assisting chip design and analysis, with recent research focusing primarily on text-based tasks such as general QA, debugging, and design tool scripting. However, the chip design and implementation workflow often requires a visual understanding of diagrams, flowcharts, graphs, schematics, waveforms, and more, necessitating the development of multi-modality foundation models. To address this gap, we propose ChipVQA, a benchmark designed to evaluate the capability of visual language models (VLMs) for chip design. ChipVQA comprises 142 carefully crafted and collected VQA questions spanning five chip design disciplines: Digital Design, Analog Design, Architecture, Physical Design, and Semiconductor Manufacturing. Unlike existing VQA benchmarks, ChipVQA questions are meticulously created by chip design experts and require in-depth domain knowledge and reasoning to solve. Our comprehensive evaluations on both open-source and proprietary multi-modal models reveal significant challenges posed by the benchmark suite, with existing VLMs struggling to meet the demands of chip design knowledge and reasoning. Notably, GPT-4o achieves only a 44% correctness rate. Additionally, we conducted a preliminary study on an alternative VLM inference methodology using an agent, which showed improved performance in certain categories without additional training, highlighting the potential of leveraging LLM agents as an alternative approach for VLM deployment in chip design.
16:53 CEST FS02.2 FVEVAL: UNDERSTANDING LANGUAGE MODEL CAPABILITIES IN FORMAL VERIFICATION OF DIGITAL HARDWARE
Speaker:
Minwoo Kang, University of California, Berkeley, US
Authors:
Minwoo Kang1, Mingjie Liu2, Ghaith Bany Hamad2, Syed Suhaib2 and Haoxing Ren2
1University of California, Berkeley, US; 2NVIDIA Corp., US
Abstract
The remarkable reasoning and code generation capabilities of large language models (LLMs) have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation.  In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV.  The benchmark consists of three sub-tasks that measure LLM capabilities at different levels---from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without ad
17:15 CEST FS02.3 MAPS: MULTI-FIDELITY AI-AUGMENTED PHOTONIC SIMULATION AND INVERSE DESIGN INFRASTRUCTURE
Speaker:
Haoyu Yang, Nvidia Inc, US
Authors:
Pingchuan Ma1, Zhengqi Gao2, Meng Zhang3, Haoyu Yang4, Haoxing Ren4, Rena Huang3, Duane Boning2 and Jiaqi Gu1
1Arizona State University, US; 2Massachusetts Institute of Technology, US; 3Rensselaer Polytechnic Institute, US; 4NVIDIA Corp., US
Abstract
"Inverse design has become a powerful approach in photonic device optimization, enabling access to high-dimensional, non-intuitive design spaces that lead to ultra-compact devices with superior performance, ultimately advancing the development of high-density photonic integrated circuits (PICs). The adjoint method plays a key role in this process by efficiently computing both the figure of merit (FoM) and its analytical gradient with only two simulations, enabling gradient-based device topology optimization. However, a significant computational bottleneck remains, i.e., the reliance on solving partial differential equations (PDEs) or eigenvalue problems within simulation-in-the-loop optimization frameworks, which hinders scalability. Recent advancements in AI-based solvers offer a promising solution by accelerating the solving of these PDEs and eigenvalue problems, enabling faster and more scalable inverse design processes. Despite these advancements, a major challenge persists—the absence of an open-source, standardized, widely available infrastructure and dataset for training and benchmarking AI-based PDE solvers tailored to photonic hardware. In this work, we introduce MAPS (Multi-Fidelity AI-Augmented Photonic Simulation and Inverse Design Benchmarking Infrastructure) to fill this gap. MAPS features: 1. MAP-Data: A photonic device dataset that covers a broad design space of representative device types, capturing both high- and low-performance designs. The dataset integrates multi-modal inputs (structure, light source, etc.) and physically significant evaluation metrics (FoMs and light fields, etc.), offering a rich data source for AI-based photonic simulation research. 2. MAPS-Train: A standardized AI-for-photonics neural operator model zoo and training framework, featuring extensible configurations and seamless integration with MAPS-Data pipelines, facilitating fair comparisons and standardized benchmarking of AI-based, physics-inspired photonic simulators. 3. MAPS-InvDes: An advanced adjoint method-based inverse design infrastructure that abstracts complex physical details, making it accessible to both computer-aided design (CAD) and machine learning (ML) communities. It integrates seamlessly with pre-trained AI-based PDE solvers and incorporates customized fabrication variation models (e.g., differentiable lithography and etching) to validate practical applicability in real-world inverse design tasks. This infrastructure MAPS bridges the gap between AI-for-physics and photonic device design by providing a standardized, open-source platform for developing and benchmarking AI-based solvers, ultimately accelerating innovation in both photonic hardware optimization and scientific ML."
17:38 CEST FS02.4 PICBENCH: BENCHMARKING LLMS FOR PHOTONIC INTEGRATED CIRCUITS DESIGN
Speaker:
Yuchao Wu, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Yuchao Wu1, Xiaofei Yu1, HAO CHEN1, Yang Luo1, Yeyu Tong2 and Yuzhe Ma1
1The Hong Kong University of Science and Technology (Guangzhou), CN; 2The Hong Kong University of Science and Technology (Guangzhou)), CN
Abstract
While large language models (LLMs) have shown remarkable potential in automating various tasks in digital chip design, the field of Photonic Integrated Circuits (PICs)—a promising solution to advanced chip designs—remains relatively unexplored in this context. The design of PICs is time-consuming and prone to errors due to the extensive and repetitive nature of code involved in photonic chip design. In this paper, we introduce PICBench, the first benchmarking and evaluation framework specifically designed to automate PIC design generation using LLMs, where the generated output takes the form of a netlist. Our benchmark consists of dozens of meticulously crafted PIC design problems, spanning from fundamental device designs to more complex circuit-level designs. It automatically evaluates both the syntax and functionality of generated PIC designs by comparing simulation outputs with expert-written solutions, leveraging an open-source simulator. We evaluate a range of existing LLMs, while also conducting comparative tests on various prompt engineering techniques to enhance LLM performance in automated PIC design. The results reveal the challenges and potential of LLMs in the PIC design domain, offering insights into the key areas that require further research and development to optimize automation in this field.

TS13 Embedded, Real-Time and Dependable Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS13.1 HARDWARE-ASSISTED RANSOMWARE DETECTION USING AUTOMATED MACHINE LEARNING
Speaker:
Zhixin Pan, Zhixin Pan, US
Authors:
Zhixin Pan1 and Ziyu Shu2
1Florida State University, US; 2Washington University in St. Louis, US
Abstract
Ransomware has emerged as a severe privacy threat, leading to significant financial and data losses worldwide. Traditional detection methods, including static signature-based detection and dynamic behavior-based analysis, have shown limitations in effectively identifying and mitigating ever-evolving ransomware attacks. In this paper, we present a machine learning-based framework with hardware-level microprocessor activity monitoring to enhance detection performance. Specifically, the proposed method incorporates adversarial training to address the weaknesses of conventional static analysis against obfuscation, along with a hardware-assisted behavior monitoring to reduce latency, achieving effective and realtime ransomware detection. The proposed method employs a Neural Architecture Search (NAS) algorithm to automate the selection of optimal machine learning models, significantly boosting generalizability. Experimental results demonstrates that our proposed method improves detection accuracy and reduces detection latency compared to existing approaches, while also maintaining a high generalizability across diverse ransomware types.
16:35 CEST TS13.2 RICH: HETEROGENEOUS COMPUTING FOR REAL-TIME INTELLIGENT CONTROL SYSTEMS
Speaker:
Jintao Chen, Shanghai Jiao Tong University, CN
Authors:
Jintao Chen, Yuankai Xu, Yinchen Ni, An Zou and Yehan Ma, Shanghai Jiao Tong University, CN
Abstract
Over the past years, intelligent control tasks, such as deep neural networks (DNNs), have demonstrated significant potential in control systems. However, deploying intelligent control policies on heterogeneous computing platforms presents open challenges. These challenges extend beyond the apparent conflict between intensive computation and timing constraints and further encompass the interactions between task executions and complicated control performance. To address these challenges, this paper introduces RICH, a general and end-to-end approach to facilitate intelligent control tasks on heterogeneous computing architectures. RICH incorporates both offline Control-Oriented Computation and Resource Mapping (CCRM) and runtime Most Remaining Accelerator Segment Number First Scheduling (MRAF). Given the control tasks, the CCRM starts with balancing the computation workloads and processor resources with the goal of optimizing overall control performance. Subsequently, the MRAF employs segment-level real-time scheduling to ensure the timely execution of tasks. Extensive experiments on the robotic arms (by hardware-in-the-loop simulator) demonstrate that the RICH can work as a general and end-to-end approach. These experiments reveal significant improvements in control performance, with enhancements of 50.7% observed for intelligent control applications deployed on heterogeneous computing platforms.
16:40 CEST TS13.3 RT-VIRTIO: TOWARDS THE REAL-TIME PERFORMANCE OF VIRTIO IN A TWO-TIER COMPUTING ARCHITECTURE
Speaker:
Siwei Ye, Shanghai Jiao Tong Univerisy, CN
Authors:
Siwei Ye1, Minqing Sun1, Huifeng Zhu2, Yier Jin3 and An Zou1
1Shanghai Jiao Tong University, CN; 2Washington University in St. Louis, US; 3University of Science and Technology of China, CN
Abstract
With the popularity of virtualization technology, ensuring reliable I/O operations with timing constraints in virtual environments becomes increasingly critical. Timing-predictable virtual I/O enhances the responsiveness and efficiency of virtualized systems, facilitating their seamless integration into time-critical applications such as industrial automation and robotics. Its significance lies in meeting rigorous performance standards, minimizing latency, and consistently delivering predictable I/O performance. As a result, virtual machines can effectively support mission-critical and time-sensitive workloads. However, due to the complicated system architecture, the IO operations in the virtualization face competition from the IO operations in the same virtual machine and the interaction from different virtual machines of which the I/O goes to the same host machine. This study presents RT-VirtIO, a practical approach to provide predictable real-time IO operations. RT-VirtIO addresses the challenges associated with lengthy data paths and complex resource management. Through early-stage characterization, this study identifies key factors contributing to poor I/O real-time performance and then builds an analytical model and a learning-based data-driven model to predict the tail I/O latency. Leveraging these two models, RT-VirtIO effectively captures these dynamics, enabling the development of a general and applicable optimization framework. Experimental results demonstrate that RT-VirtIO significantly improves real-time performance in virtual environments (by 20.07%~30.90%) without necessitating hardware modifications, which exhibit promising applicability across a broader range of scenarios.
16:45 CEST TS13.4 ENABLING SECURITY ON THE EDGE: A CHERI COMPARTMENTALIZED NETWORK STACK
Speaker:
Donato Ferraro, University of Modena and Reggio Emilia, Minerva Systems, IT
Authors:
Donato Ferraro1, Andrea Bastoni2, Alexander Zuepke3 and Andrea Marongiu4
1Minerva Systems SRL, University of Modena and Reggio Emilia, IT; 2TUM, Minerva Systems, DE; 3TU Munich, DE; 4Università di Modena e Reggio Emilia, IT
Abstract
The widespread deployment of embedded systems in critical infrastructures, interconnected edge devices like autonomous drones, and smart industrial systems requires robust security measures. Compromised systems increase the risks of operational failures, data breaches, and---in safety-critical environments---potential physical harm to people. Despite these risks, current security measures are often insufficient to fully address the attack surfaces of embedded devices. CHERI provides strong security from the hardware level by enabling fine-grained compartmentalization and memory protection, which can reduce the attack surface and improve the reliability of such devices. In this work, we explore the potential of CHERI to compartmentalize one of the most critical and targeted components of interconnected systems: their network stack. Our case study examines the trade-offs of isolating applications, TCP/IP libraries, and network drivers on a CheriBSD system deployed on the Arm Morello platform. Our results suggest that CHERI has the potential to enhance security while maintaining performance in embedded-like environments.
16:50 CEST TS13.5 TOWARDS RELIABLE SYSTEMS: A SCALABLE APPROACH TO AXI4 TRANSACTION MONITORING
Speaker:
Chaoqun Liang, Università di Bologna, IT
Authors:
Chaoqun Liang1, Thomas Benz2, Alessandro Ottaviano2, Angelo Garofalo1, Luca Benini1 and Davide Rossi1
1Università di Bologna, IT; 2ETH Zurich, CH
Abstract
In safety-critical SoC applications such as automotive and aerospace, reliable transaction monitoring is crucial for maintaining system integrity. This paper introduces a drop-in Transaction Monitoring Unit (TMU) for AXI4 subordinate endpoints that detects transaction failures including protocol violations or timeouts and triggers recovery by resetting the affected subordinates. Two TMU variants address different constraints: a Tiny-Counter solution for tightly area-constrained systems and a Full-Counter solution for critical subordinates in mixed-criticality SoCs. The Tiny-Counter employs a single counter per outstanding transaction, while the Full-Counter uses multiple counters to track distinct transaction stages, offering finer-grained monitoring and reducing detection latencies by up to hundreds of cycles at roughly 2.5× the area cost. The Full-Counter also provides detailed error logs for performance and bottleneck analysis. Evaluations at both IP and system levels confirm the TMU's effectiveness and low overhead. In GF12 technology, monitoring 16–32 outstanding transactions occupies 1330–2616 µm2 for the tiny-Counter and 3452–6787 µm2 for the Full-Counter; moderate prescaler steps reduce these figures by 18–39% and 19–32%, respectively, with no loss of functionality. Results from a full-system integration demonstrate the TMU's robust and precise monitoring capabilities in safety-critical SoC environments.
16:55 CEST TS13.6 EXACT SCHEDULABILITY ANALYSIS FOR LIMITED-PREEMPTIVE PARALLEL APPLICATIONS USING TIMED AUTOMATA IN UPPAAL
Speaker:
Jonas Hansen, Aalborg Universitet, DK
Authors:
Jonas Hansen1, Srinidhi Srinivasan2, Geoffrey Nelissen3 and Kim Larsen1
1Aalborg Universitet, DK; 2Technische Universiteit Eindhoven (TU/e), NL; 3Eindhoven University of Technology, NL
Abstract
We study the problem of verifying schedulability and ascertaining response time bounds of limited-preemptive parallel applications with uncertainty, scheduled on multi-core platforms. While sufficient techniques exist for analysing schedulability and response time of parallel applications under fixed-priority scheduling, their accuracy remains uncertain due to the lack of a scalable and exact analysis that can serve as a ground-truth to measure the pessimism of existing sufficient analyses. In this paper, we address this gap using formal methods. We use Timed Automata and the powerful UPPAAL verification engine to develop a generic approach to model parallel applications and provide a scalable and exact schedulability and response time analysis. This work establishes a benchmark for evaluating the accuracy of both existing and future sufficient analysis techniques. Furthermore, our solution is easily extendable to more complex task models thanks to its flexible model architecture.
17:00 CEST TS13.7 MONOMORPHISM-BASED CGRA MAPPING VIA SPACE AND TIME DECOUPLING
Speaker:
Cristian Tirelli, Università della Svizzera italiana, CH
Authors:
Cristian Tirelli, Rodrigo Otoni and Laura Pozzi, Università della Svizzera italiana, CH
Abstract
Coarse-Grain Reconfigurable Arrays (CGRAs) provide flexibility and energy efficiency in accelerating compute-intensive loops. Existing compilation techniques often struggle with scalability, unable to map code onto large CGRAs. To address this, we propose a novel approach to the mapping problem where the time and space dimensions are decoupled and explored separately. We leverage an SMT formulation to traverse the time dimension first, and then perform a monomorphism-based search to find a valid spatial solution. Experimental results show that our approach achieves the same mapping quality of state-of-the-art techniques while significantly reducing compilation time, with this reduction being particularly tangible when compiling for large CGRAs. We achieve approximately 10^5x average compilation speedup for the benchmarks evaluated on a 20x20 CGRA.
17:05 CEST TS13.8 ATTENTIONLIB: A SCALABLE OPTIMIZATION FRAMEWORK FOR AUTOMATED ATTENTION ACCELERATION ON FPGA
Speaker:
Zhenyu Liu, Fudan University, CN
Authors:
Zhenyu Liu, Xilang Zhou, Faxian Sun, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN
Abstract
The self-attention mechanism is a fundamental component within transformer-based models. Nowadays, as the length of sequences processed by large language models (LLMs) continues to increase, the attention mechanism has gradually become a bottleneck in model inference. The LLM inference process can be separated into two phases: prefill and decode. The latter contains memory-intensive attention computation, making FPGA-based accelerators an attractive solution for acceleration. However, designing accelerators tailored for the attention module poses a challenge, requiring substantial manual work. To automate this process and achieve superior acceleration performance, we propose AttentionLib, an MLIR-based framework. AttentionLib automatically performs fusion dataflow optimization for attention computations and generates high-level synthesis code in compliance with hardware constraints. Given the large design space, we provide a design space exploration (DSE) engine to automatically identify optimal fusion dataflows within the specified constraints. Experimental results show that AttentionLib is effective in generating well-suited accelerators for diverse attention computations and achieving superior performance under hardware constraints. Notably, the accelerators generated by AttentionLib exhibit at least a 25.1× improvement compared to the baselines solely automatically optimized by Vitis HLS. Furthermore, these designs outperform GPUs in decode workloads, showcasing over a 2× speedup for short sequences.
17:10 CEST TS13.9 ENSURING DATA FRESHNESS FOR IN-STORAGE COMPUTING WITH COOPERATIVE BUFFER MANAGER
Speaker:
Yang Guo, The Chinese University of Hong Kong, HK
Authors:
Jin Xue, Yuhong Song, Yang Guo and Zili Shao, The Chinese University of Hong Kong, HK
Abstract
In-storage computing (ISC) aims to mitigate the excessive data movement between the host memory and storage by offloading computation to storage devices for in-situ execution. However, ensuring data freshness remains a key challenge for practical ISC. For performance considerations, many data processing systems implement a buffer manager to cache part of the on-disk data in the host memory. While the host applications commit updates to the in-memory cached copies of the data, ISC operators offloaded to the device only have access to the on-disk persistent data. Thus, ISC may miss the most recent updates from the host and produce incorrect results after reading the stale and inconsistent data from the persistent storage. With this limitation, current ISC can only be used in read-only settings where the on-disk data are not subject to concurrent updates. To tackle this problem, we propose a cooperative buffer manager for ISC to transparently provide data freshness guarantees to host applications. Proposed methods allow the device to synchronize with the host buffer manager and decide whether to read the most recent copy of data from host memory or flash memory. We implement our method based on a real hardware platform and perform evaluation with a B+-tree based key-value store. Experiments show that our method can provide transparent data freshness for host applications with reduced latency.
17:15 CEST TS13.10 EVALUATING COMPILER-BASED RELIABILITY WITH RADIATION FAULT INJECTION
Speaker:
Davide Baroffio, Politecnico di Milano, IT
Authors:
Davide Baroffio, Tomas López, Federico Reghenzani and William Fornaciari, Politecnico di Milano, IT
Abstract
Compiler-based fault tolerance is a cost-effective and flexible family of solutions that transparently improves software reliability. This paper evaluates a compiler tool for fault detection via laser injection and $alpha$-particle exposure. A novel memory allocation strategy is proposed to mitigate the effects of multi-bit upsets. We integrated the detection mechanism with a recovery solution based on mixed-criticality scheduling. The results demonstrate the error detection and recovery capabilities in realistic scenarios: reducing undetected errors, enhancing system reliability, and advancing software-implemented fault tolerance.
17:16 CEST TS13.11 UMBRA: AN EFFICIENT FRAMEWORK FOR TRUSTED EXECUTION ON MODERN TRUSTZONE-ENABLED MICROCONTROLLERS
Speaker:
Stefano Mercogliano, Università di Napoli Federico II, IT
Authors:
Stefano Mercogliano1 and Alessandro Cilardo2
1Università di Napoli Federico II, IT; 2University of Naples, Federico II, IT
Abstract
The rise of microcontrollers in critical systems demands robust security measures beyond traditional methods like Memory Protection Units. ARM's TrustZone-M offers enhanced protection for secure applications, yet its potential for deploying Trusted Execution Environments often remains untapped, leaving room for innovation in managing security on resource-constrained devices. This paper presents Umbra, a Rust-based framework that isolates mutually distrustful applications and integrates with untrusted embedded OSes. Leveraging modern security hardware, Umbra features an efficient secure caching mechanism that encrypts all code exposed to attackers, decrypting and validating only necessary blocks during execution, achieving practical Trusted Execution Environments on modern microcontrollers. Index Terms—ARM TrustZone-M, Trusted Execution Environment, Rust for Secure Development, Lightweight Security Mechanisms
17:17 CEST TS13.12 HARDWARE/SOFTWARE CO-ANALYSIS FOR WORST CASE EXECUTION TIME BOUNDS
Speaker:
Can Joshua Lehmann, Karlsruhe Institute of Technology, DE
Authors:
Can Lehmann1, Lars Bauer2, Hassan Nassar1, Heba Khdr1 and Joerg Henkel1
1Karlsruhe Institute of Technology, DE; 2Independent Scholar, DE
Abstract
Ensuring that safety-critical systems meet timing constraints is crucial to avoid disastrous failures. To verify that timing requirements are met, a worst-case execution time (WCET) bound is computed. However, traditional WCET tools require a predefined timing model for each target processor, which is not available when using custom instruction set extensions. We introduce a novel approach based on hardware-software co-analysis that employs an instrumented hardware description of the target processor, removing the requirement for a separate timing model. We demonstrate this approach by extending the FemtoRV32 Individua RISC-V processor with a custom instruction set extension and show that it accurately models the timing behavior of the resulting system.

TS14 Architectural and microarchitectural design - 2

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS14.1 SPARSYNERGY: UNLOCKING FLEXIBLE AND EFFICIENT DNN ACCELERATION THROUGH MULTI-LEVEL SPARSITY
Speaker:
Jingkui Yang, National University of Defense Technology, CN
Authors:
Jingkui Yang1, Mei Wen1, Junzhong Shen2, Jianchao Yang1, Yasong Cao1, Jun He1, Minjin Tang3, Zhaoyun Chen1 and Yang Shi4
1National University of Defense Technology, CN; 2Key Laboratory of Advanced Microprocessor Chips and Systems, National University of Defense Technology, CN; 3National University of Defense Technology, Key Laboratory of Advanced Microprocessor Chips and Systems, CN; 41.National Key Laboratory for Parallel and Distributed Processing, National University of Defense Technology;2.Department of Computer, National University of Defense Technology, CN
Abstract
To more effectively address the computational and memory requirements of deep neural networks (DNNs), leveraging multi-level sparsity---including value-level and bit-level sparsity--- has emerged as a pivotal strategy. While substantial research has been dedicated to exploring value-level and bit-level sparsity individually, the combination of both has largely been overlooked until now. In this paper, we propose SparSynergy, which---to the best of our knowledge---is the first accelerator that synergistically integrates multi-level sparsity into a unified framework, maximizing computational efficiency and minimizing memory usage. However, jointly considering multi-level sparsity is non-trivial, as it presents several challenges: (1) increased hardware overhead due to the complexity of incorporating multiple sparsity levels, (2) bandwidth-intensive data transmission during multiplexing, and (3) decreased throughput and scalability caused by bottlenecks in bit-serial computation. Our proposed SparSynergy addresses these challenges by introducing a unified sparsity format and a co-optimized hardware design. Experimental results demonstrate that SparSynergy achieves a 5.38x geometric mean improvement in the energy-delay product (EDP) when compared with the tensor core, across workloads with varying degrees of sparsity. Furthermore, SparSynergy significantly improves accuracy retention compared to state-of-the-art accelerators for representative DNNs.
16:35 CEST TS14.2 PS-GS: GROUP-WISE PARALLEL RENDERING WITH STAGE-WISE COMPLEXITY REDUCTIONS FOR REAL-TIME 3D GAUSSIAN SPLATTING
Speaker:
Joongho Jo, Korea University, KR
Authors:
Joongho Jo and Jongsun Park, Korea University, KR
Abstract
3D Gaussian Splatting (3D-GS) is an emerging rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and image quality. Despite its advantages, running 3D-GS on mobile or edge devices in real-time remains challenging due to large computational complexity. In this paper, we introduce PS-GS, a specialized low-complexity hardware designed to enhance the pipeline parallelism of 3D-GS rendering pipeline process. In this work, we first observe that 3D-GS rendering can be parallelized when the approximate order of Gaussians, from those closest to the camera to those farthest, is known ahead. But, to enhance 3D-GS rendering speed via parallel processing, an efficient viewpoint-adaptive grouping method with low computational costs is essential. Two key computational bottlenecks of viewpoint-adaptive grouping are the grouping of invisible Gaussians and depth-based sorting. For efficient group-wise parallel rendering with low complexity viewpoint-adaptive grouping, we propose three key techniques—cluster-based preprocessing, sorting, and grouping—all seamlessly incorporated into the PS-GS architecture. Our experimental results demonstrate that PS-GS delivers an average speedup of 1.20× with negligible peak signal-to-noise ratio (PSNR) degradation.
16:40 CEST TS14.3 TXISC: TRANSACTIONAL FILE PROCESSING IN COMPUTATIONAL SSDS
Speaker:
Penghao Sun, Shanghai Jiao Tong University, CN
Authors:
Penghao Sun1, Shengan Zheng1, Kaijiang Deng1, Guifeng Wang1, Jin Pu1, Jie Yang2, Maojun Yuan2, Feng Zhu2, Shu Li2 and Linpeng Huang1
1Shanghai Jiao Tong University, CN; 2Alibaba Group, CN
Abstract
Computational SSDs implement the in-storage computing (ISC) paradigm and benefit applications by taking over I/O-intensive tasks from the host. Existing works have proposed various frameworks aiming at easy access to ISC functionalities, and among them generic frameworks with file-based abstractions offer better usability. However, since intermediate output by ISC tasks may leave files in a dirty state, concurrent access to and the integrity of file data should be properly managed, which has not been fully addressed. In this paper, we present TxISC, a generic ISC framework that coordinates the host kernel and device firmware to offer a versatile file-based programming model. Under the hood, TxISC turns each invocation of an ISC task into a transaction with full ACID guarantee, fully covering concurrency control and data protection. TxISC implements transactions at low cost by leveraging the out-of-place write characteristic of NAND flash. Evaluation on full-stack hardware shows that transactions incur almost no runtime performance penalty compared with existing ISC architectures. Application case studies demonstrate that the programming model of TxISC can be used to offload complex logic and deliver significant speedup over host-only solutions.
16:45 CEST TS14.4 ARAXL: A PHYSICALLY SCALABLE, ULTRA-WIDE RISC-V VECTOR PROCESSOR DESIGN FOR FAST AND EFFICIENT COMPUTATION ON LONG VECTORS
Speaker:
Navaneeth Kunhi Purayil, ETH Zurich, CH
Authors:
Navaneeth Kunhi Purayil1, Matteo Perotti1, Tim Fischer1 and Luca Benini2
1ETH Zurich, CH; 2ETH Zurich, CH | Università di Bologna, IT
Abstract
The ever-growing scale of data parallelism in today's HPC and ML applications presents a big challenge for computing architectures' energy efficiency and performance. Vector processors address the scale-up challenge by decoupling Vector Register File (VRF) and datapath widths, allowing the VRF to host long vectors and increase register-stored data reuse while reducing the relative cost of instruction fetch and decode. However, even the largest vector processor designs today struggle to scale to more than 8 vector lanes with double-precision Floating Point Units (FPUs) and 256 64-bit elements per vector register. This limitation is induced by difficulties in the physical implementation, which becomes wire-dominated and inefficient. In this work, we present AraXL, a modular and scalable 64-bit RISC-V V vector architecture targeting long-vector applications for HPC and ML. AraXL addresses the physical scalability challenges of state-of-the-art vector processors with a distributed and hierarchical interconnect, supporting up to 64 parallel vector lanes and reaching the maximum Vector Register File size of 64 Kibit/vreg permitted by the RISC-V V 1.0 ISA specification. Implemented in a 22-nm technology node, our 64-lane AraXL achieves a performance peak of 146 GFLOPs on computation-intensive HPC/ML kernels (>99% FPU utilization) and energy efficiency of 40.1 GFLOPs/W (1.15 GHz, TT, 0.8V), with only 3.8x the area of a 16-lane instance.
16:50 CEST TS14.5 PERFORMANCE IMPLICATIONS OF MULTI-CHIPLET NEURAL PROCESSING UNITS ON AUTONOMOUS DRIVING PERCEPTION
Speaker:
Luke Chen, University of California, Irvine, US
Authors:
Mohanad Odema, Luke Chen, Hyoukjun Kwon and Mohammad Al Faruque, University of California, Irvine, US
Abstract
We study the application of emerging chiplet-based Neural Processing Units to accelerate vehicular AI perception workloads in constrained automotive settings. The motivation stems from how chiplets technology is becoming integral to emerging vehicular architectures, providing a cost-effective trade-off between performance, modularity, and customization; and from perception models being the most computationally demanding workloads in a autonomous driving system. Using the Tesla Autopilot perception pipeline as a case study, we first breakdown its constituent models and profile their performance on different chiplet accelerators. From the insights, we propose a novel scheduling strategy to efficiently deploy perception workloads on multi-chip AI accelerators. Our experiments using a standard DNN performance simulator, MAESTRO, show our approach realizes 82% and 2.8× increase in throughput and processing engines utilization compared to monolithic accelerator designs.
16:55 CEST TS14.6 LT-OAQ: LEARNABLE THRESHOLD BASED OUTLIER-AWARE QUANTIZATION AND ITS ENERGY-EFFICIENT ACCELERATOR FOR LOW-PRECISION ON-CHIP TRAINING
Speaker:
Qinkai Xu, Nanjing University, CN
Authors:
Qinkai Xu, Yijin Liu, Yuan Meng, Yang Chen, Yunlong Mao, Li Li and Yuxiang Fu, Nanjing University, CN
Abstract
Low-precision training has emerged as a powerful technique for reducing computational and storage costs in Deep Neural Network (DNN) model training, enabling on-chip training or fine-tuning on edge devices. However, existing low-precision training methods often require higher bit-widths to maintain accuracy as model sizes increase. In this paper, we introduce an outlier-aware quantization strategy for low-precision training. While traditional value-aware quantization methods require costly online distribution statistics operations on computational data, impeding the efficiency gains of low-precision training, our approach addresses this challenge through a novel Learnable Threshold based Outlier-Aware Quantization (LT-OAQ) training framework. This method concurrently updates outlier thresholds and model weights through gradient descent, eliminating the need for costly data-statistics operations. To efficiently support the LT-OAQ training framework, we designed a hardware accelerator based on the systolic array architecture. This accelerator introduces a processing element (PE) fusion mechanism that dynamically fuses adjacent PEs into clusters to support outlier computations, optimizing the mapping of outlier computation tasks, enabling mixed-precision training, and implementing online quantization. Our approach maintains model accuracy while significantly reducing computational complexity and storage resource requirements. Experimental results demonstrate that our design achieves a 2.9x speedup in performance and a 2.17x reduction in energy consumption compared to state-of-the-art low-precision accelerators.
17:00 CEST TS14.7 LIGNN: ACCELERATING GNN TRAINING THROUGH LOCALITY-AWARE DROPOUT
Speaker:
Gongjian Sun, SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, CN
Authors:
Gongjian Sun1, Mingyu Yan2, Dengke Han3, Runzhen Xue4, Xiaochun Ye1 and Dongrui Fan1
1SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN; 3SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; 4State Key Lab of Processors, Institute of Computing Technology, CAS; School of Computer Science and Technology, University of Chinese Academy of Sciences, CN
Abstract
Graph Neural Networks (GNNs) have demonstrated significant success in graph learning and are widely adopted across various critical domains. However, the irregular connectivity between vertices leads to inefficient neighbor aggregation, resulting in substantial irregular and coarse-grained DRAM accesses. This lack of data locality presents significant challenges for execution platforms, ultimately degrading performance. While previous accelerator designs have leveraged on-chip memory and data access scheduling strategies to address this issue, they still inevitably access features at irregular addresses from DRAM. In this work, we propose LiGNN, a hardware-based solution that enhances locality and applies dropout to aggregation to accelerate GNN training. Unlike algorithmic dropout approaches that primarily focus on improving accuracy and neglects hardware costs, LiGNN is specifically designed to drop graph features with data locality awareness, directly targeting the reduction of irregular DRAM accesses, meanwhile maintaining accuracy. LiGNN introduces locality-aware ordering and a DRAM row integrity policy, enabling configurable burst and row-granularity dropout at the DRAM level. This approach improves data locality and ensures more efficient DRAM access. Compared to state-of-the-art methods, under classic 0.5 droprate, LiGNN achieves a 1.6~2.2x speedup, reduces DRAM accesses by 44~50% and DRAM row activation by 41~82%, all without losing accuracy.
17:05 CEST TS14.8 COUPLEDCB: ELIMINATING WASTED PAGES IN COPYBACK-BASED GARBAGE COLLECTION FOR SSDS
Speaker:
Jun Li, Nanjing University of Posts and Telecommunications, CN
Authors:
Jun Li1, Xiaofei Xu2, zhibing sha3, Xiaobai Chen1, Jieming Yin1 and Jianwei Liao4
1Nanjing University of Posts and Telecommunications, CN; 2RMIT University, AU; 3southwest university, CN; 4Southwest University of China, CN
Abstract
The management of garbage collection poses significant challenges in high-density NAND flash-based SSDs. The introduction of the copyback command aims to expedite the migration of valid data. However, its odd/even constraint causes wasted pages during migrations, limiting the efficiency of garbage collection. Additionally, while full-sequence programming enhances write performance in high-density SSDs, it increases write granularity and exacerbates the issue of wasted pages. To address the problem of wasted pages, we propose a novel method called CoupledCB, which utilizes coupled blocks to fill up the wasted space in copyback-based garbage collection. By taking into account the access characteristics of the candidate coupled blocks and workloads, we develop a coupled block selection model assisted by logistic regression. Experimental results show that our proposal significantly enhances garbage collection efficiency and I/O performance compared to state-of-the-art schemes.
17:10 CEST TS14.9 LIGHTMAMBA: EFFICIENT MAMBA ACCELERATION ON FPGA WITH QUANTIZATION AND HARDWARE CO-DESIGN
Speaker:
Renjie Wei, Peking University, CN
Authors:
Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang and Meng Li, Peking University, CN
Abstract
State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65∼6.06× higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43× that of the GPU baseline.
17:15 CEST TS14.10 EVALUATING IOMMU-BASED SHARED VIRTUAL ADDRESSING FOR RISC-V EMBEDDED HETEROGENEOUS SOCS
Speaker:
Cyril Koenig, ETH Zurich, CH
Authors:
Cyril Koenig, Enrico Zelioli and Luca Benini, ETH Zurich, CH
Abstract
Embedded heterogeneous Systems-on-Chips (SoCs) rely on domain-specific hardware accelerators to improve performance and energy efficiency. In particular, programmable multicore accelerators feature a cluster of processing elements and tightly coupled scratchpad memories to balance performance, energy efficiency, and flexibility. In embedded systems running a general-purpose OS, accelerators access data via dedicated, physically addressed memory regions. This negatively impacts memory utilization and performance by requiring a copy from the virtual host address to the physical accelerator address space. Input-Output Memory Management Units (IOMMUs) overcome this limitation by allowing devices and hosts to use a shared virtual, paged address space. However, resolving IO virtual addresses can be particularly costly on high-latency memory systems as it requires up to three sequential memory accesses on IOTLB miss. In this work, we present a quantitative evaluation of shared virtual addressing in RISC-V heterogeneous embedded systems. We integrate an IOMMU in an open source heterogeneous RISC-V SoC consisting of a 64-bit host with a 32-bit accelerator cluster. We evaluate the system performance by emulating the design on FPGA and implementing compute kernels from the RajaPERF benchmark suite using heterogeneous OpenMP programming. We measure transfers and computation time on the host and accelerators for systems with different DRAM access latencies. We first show that IO virtual address translation can account for 4.2% up to 17.6% of the accelerator's runtime for GEMM (General Matrix Multiplication) at low and high memory bandwidth. Then, we show that in systems containing a last-level cache, this IO address translation cost falls to 0.4% and 0.7% under the same conditions, making shared-virtual addressing and zero-copy offloading suitable for such RISC-V heterogeneous SoCs.

TS15 Power and Energy Efficient Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS15.1 LESS IS MORE: OPTIMIZING FUNCTION CALLING FOR LLM EXECUTION ON EDGE DEVICES
Speaker:
Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US
Authors:
Varatheepan Paramanayakam1, Andreas Karatzas2, Iraklis Anagnostopoulos2 and Dimitrios Stamoulis3
1Southern Illinois University, US; 2Southern Illinois University Carbondale, US; 3The University of Texas at Austin, US
Abstract
The advanced function-calling capabilities of foundation models open up new possibilities for deploying agents to perform complex API tasks. However, managing large amounts of data and interacting with numerous APIs makes function calling hardware-intensive and costly, especially on edge devices. Current Large Language Models (LLMs) struggle with function calling at the edge because they cannot handle complex inputs or manage multiple tools effectively. This results in low task-completion accuracy, increased delays, and higher power consumption. In this work, we introduce Less-is-More, a novel fine-tuning-free function-calling scheme for dynamic tool selection. Our approach is based on the key insight that selectively reducing the number of tools available to LLMs significantly improves their function-calling performance, execution time, and power efficiency on edge devices. Experimental results with state-of-the-art LLMs on edge hardware show agentic success rate improvements, with execution time reduced by up to 70% and power consumption by up to 40%.
16:35 CEST TS15.2 SSMDVFS: MICROSECOND-SCALE DVFS BASED ON SUPERVISED AND SELF-CALIBRATED ML ON GPGPUS
Speaker:
Minqing Sun, Shanghai Jiao Tong University, CN
Authors:
Minqing Sun1, Ruiqi Sun1, Yingtao Shen1, Wei Yan2, Qinfen Hao2 and An Zou1
1Shanghai Jiao Tong University, CN; 2The Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
Over the past decade, as GPUs have evolved to achieve higher computational performance, their power density has also accelerated. Consequently, improving energy efficiency and reducing power consumption has become critically important. Dynamic voltage and frequency scaling (DVFS) is an effective technique for enhancing energy efficiency. With the advent of integrated voltage regulators, DVFS can now operate on microsecond (µs) timescales. However, developing a practical and effective strategy to guide rapid DVFS remains a significant challenge. This paper proposes a supervised and self-calibrated machine learning framework (SSMDVFS) to guide microsecond-scale GPU voltage and frequency scaling. This framework features an end-to-end design that encompasses data generation, neural network model design, training, compression, and final runtime calibration. Unlike analytical models, which struggle to accurately represent GPU architectures, and reinforcement learning approaches, which can be challenging to converge during runtime, the SSMDVFS offers a practical solution for guiding microsecond-scale voltage and frequency scaling. Experimental results demonstrate that the proposed framework improves energy-delay product (EDP) by 11.09% and outperforms analytical models and reinforcement learning approaches by 13.17% and 36.80%, respectively.
16:40 CEST TS15.3 A 3D DESIGN METHODOLOGY FOR INTEGRATED WEARABLE SOCS: ENABLING ENERGY EFFICIENCY AND ENHANCED PERFORMANCE AT ISO-AREA FOOTPRINT
Speaker:
Ekin Sumbul, Meta, US
Authors:
H. Ekin Sumbul1, Arne Symons2, Lita Yang2, Huichu Liu2, Tony Wu2, Matheus Trevisan Moreira2, Debabrata Mohapatra2, Abhinav Agarwal2, Kaushik Ravindran2, Chris Thompson2, Yuecheng Li2 and Edith Beigne2
1Meta, US; 2META, US
Abstract
Augmented Reality (AR) System-on-Chips (SoCs) have strict power budgets and form-factor limitations for wearable, all-day use AR glasses running high-performance applications. Limited compute and memory resources that can fit within the strict industrial design area footprint of an AR SoC, however, create performance bottlenecks for demanding workloads such as Pixel Codec Avatars (PiCA) group-calling which connects multiple users with their photorealistic representations. To alleviate this unique wearables challenge, 3D integration with hybrid-bonding technology offers energy-efficient 3D stacking of more silicon resources within the same SoC footprint. Implementing such 3D architectures, however, is another challenge as current EDA tools and flows offer limited 3D design control. In this work, we present a 3D design methodology for robust 3D clock network and datapath design using current EDA tools. To validate the proposed methodology, we implemented a 3D integrated prototype AR SoC housing a 3D-stacked Machine Learning (ML) accelerator utilizing TSMC SoIC™bonding technology. Silicon measurements demonstrate that the 3D ML accelerator enables running PiCA AR group call at 30 frames-per-second (fps) by 3D-expanding its memory resources by 4× to achieve 2× better energy-efficiency when compared to a 2D baseline accelerator at iso-footprint.
16:45 CEST TS15.4 A LOW-POWER MIXED-PRECISION INTEGRATED MULTIPLY-ACCUMULATE ARCHITECTURE FOR QUANTIZED DEEP NEURAL NETWORKS
Speaker:
Hu Xiaolu, Department of Micro-Nano Electronics, Shanghai Jiao Tong University, Shanghai, China, CN
Authors:
Xiaolu Hu1, Xinkuang Geng1, Zhigang Mao2, Jie Han3 and Honglan Jiang1
1Shanghai Jiao Tong University, CN; 2Department of Mico-Nano Electronics, CN; 3University of Alberta, CA
Abstract
As mixed-precision quantization techniques have been widely considered for balancing computational efficiency and flexibility in quantized deep neural networks (DNNs), mixed-precision multiply-accumulate (MAC) units are increasingly important in DNN accelerators. However, conventional mixed-precision MAC architectures support either signed×signed or unsigned×unsigned multiplications. The signed×unsigned multiplication enhancing the computing efficiency of DNNs with ReLU activations has never been considered in the design of mixed-precision MAC. Thus, this work proposes a mixed-precision MAC architecture supporting six operation modes, int8×int8, int8×uint8, two int4×int4, two int4×uint4, four int2×int2, and four int2×uint2. In this design, to balance the power and delay of different modes, the multiplication is implemented based on four precision-split 4×4 multipliers (PS4Ms). The accumulation is integrated into the partial product accumulation of the multiplication to eliminate redundant switching activities in separate compression. With 10% area reduction, the proposed MAC denoted as PS4MAC, reduces the power by over 35%, 42%, and 56% for 8-bit, 4-bit, and 2-bit operations, respectively, compared with the design based on the Synopsys DesignWare (DW) multipliers. Additionally, it achieves over 23% power savings for 8-bit operations compared to state-of-the-art (SotA) mixed-precision MAC designs. To save more power, an approximate computing mode for 8-bit multiplication is further designed, resulting in a MAC unit enabling eight operation modes, referred to as PS4MAC_AP. Finally, output-stationary systolic arrays (SAs) are explored using the above-mentioned MAC designs to implement DNNs operating under a 1 GHz clock. Our designs show the highest energy efficiency and outstanding area efficiency in all 8-bit, 4-bit, and 2-bit operation modes. Compared with the traditional SA with high-precision-split multipliers, PS4MAC_AP improves the energy efficiency for 8-bit operations by 0.6 TOPS/W, and PS4MAC achieves 0.4 TOPS/W - 0.7 TOPS/W improvement for all operation modes.
16:50 CEST TS15.5 FEDERATED REINFORCEMENT LEARNING FOR OPTIMIZING THE POWER EFFICIENCY OF EDGE DEVICES
Speaker:
Benedikt Dietrich, Karlsruhe Institute of Technology, DE
Authors:
Benedikt Dietrich1, Rasmus Müller-Both2, Heba Khdr3 and Joerg Henkel3
1Chair for Embedded Systems, Karlsruhe Institute of Technology, DE; 2-, DE; 3Karlsruhe Institute of Technology, DE
Abstract
Reinforcement learning (RL) holds great promise for adaptively optimizing microprocessor performance under power constraints. It allows for online learning of application charac- teristics at runtime and enables adjustment to varying system dynamics such as changes in the workload, user preferences or ambient conditions. However, online policy optimization remains resource-intensive, with high computational demand and requiring many samples to converge, making it challenging to deploy to edge devices. In this work, we overcome both of these obstacles and present federated power control using dynamic voltage and frequency scaling (DVFS). Our technique leverages federated RL and enables multiple independent power controllers running on separate devices to collaboratively train a shared DVFS policy, consolidating experience from a multitude of different applica- tions, while ensuring that no privacy-sensitive information leaves the devices. This leads to faster convergence and to increased robustness of the learned policies. We show that our federated power control achieves 57 % average performance improvements over a policy that is only trained on local data. Compared to a state-of-the-art collaborative power control, our technique leads to 22 % better performance on average for the running applications under the same power constraint.
16:55 CEST TS15.6 AXON: A NOVEL SYSTOLIC ARRAY ARCHITECTURE FOR IMPROVED RUN TIME AND ENERGY EFFICIENT GEMM AND CONV OPERATION WITH ON-CHIP IM2COL
Speaker:
Md Mizanur Rahaman Nayan, Georgia Tech, US
Authors:
Md Mizanur Rahaman Nayan1, Ritik Raj1, Gouse Shaik Basha1, Tushar Krishna2 and Azad J Naeemi1
1Georgia Tech, US; 2tushar, US
Abstract
General matrix multiplication (GeMM) is a core operation in virtually all AI applications. Systolic array (SA) based architectures have shown great promise as GeMM hardware accelerators thanks to their speed and energy efficiency. Unfortunately, SAs incur a linear delay in filling the operands, due to unidirectional propogation via pipeline latches. In this work, we propose a novel in-array data orchestration technique in SAs where we enable data feeding on the principal diagonal followed by bi-directional propagation. This improves the runtime by up to 2× at minimal hardware overhead. In addition, the proposed data orchestration enables convolution lowering (known as im2col) using a simple hardware support to fully exploit input feature map reuse opportunity and significantly lower the off-chip memory traffic resulting in 1.2× throughput improvement and 2.17× inference energy reduction during YOLOv3 and RESNET50 workload on average. In contrast, conventional data orchestration would require more elaborate hardware and control signals to implement im2col in hardware because of the data skew. We have synthesized and conducted place and route for 16×16 systolic arrays based on the novel and conventional orchestrations using ASAP 7nm PDK and found that our proposed approach results in 0.211% area and 1.6% power overheads
17:00 CEST TS15.7 TEMPUS CORE: AREA-POWER EFFICIENT TEMPORAL-UNARY CONVOLUTION CORE FOR LOW-PRECISION EDGE DLAS
Speaker:
Prabhu Vellaisamy, Carnegie Mellon University, US
Authors:
Prabhu Vellaisamy1, Harideep Nair1, Thomas Kang1, Yichen Ni1, Haoyang Fan1, Bin Qi1, Hsien-Fu Hung1, Jeff Chen1, Shawn Blanton1 and John Shen2
1Carnegie Mellon University, US; 2Carnegie Mellon university, US
Abstract
The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.
17:05 CEST TS15.8 ADAPTIVE MULTI-THRESHOLD ENCODING FOR ENERGY-EFFICIENT ECG CLASSIFICATION ARCHITECTURE USING SPIKING NEURAL NETWORK
Speaker:
Mohammad Amin Yaldagard, TU Delft, NL
Authors:
Sumit Diware, Yingzhou Dong, Mohammad Amin Yaldagard and Rajendra Bishnoi, TU Delft, NL
Abstract
Timely identification of cardiac arrhythmia (abnormal heartbeats) is vital for early diagnosis of cardiovascular diseases. Wearable healthcare devices facilitate this process by recording heartbeats through electrocardiogram (ECG) signals and using AI-driven hardware to classify them into arrhythmia classes. Spiking neural networks (SNNs) are well-suited for such hardware as they consume low energy due to event-driven operation. However, their energy-efficiency is constrained by encoding methods that translate real-valued ECG data into spikes. In this paper, we present an SNN-based ECG classification architecture featuring a new adaptive multi-threshold spike encoding scheme. This scheme adjusts encoding window and granularity based on the importance of ECG data samples, to capture essential information with fewer spikes. We develop a high-accuracy SNN model for such spike representation, by proposing a technique specifically tailored to our encoding. We design a hardware architecture for this model, which incorporates optimized layer post-processing for energy-efficient data-flow and employs fixed-point quantization for computational efficiency. Moreover, we integrate this architecture with our encoding scheme into a system-on-chip implementation using TSMC 40nm technology. Results show that our proposed approach achieves better energy-efficiency compared to state-of-the-art, with high ECG classification accuracy.
17:10 CEST TS15.9 LOWGRADQ: ADAPTIVE GRADIENT QUANTIZATION FOR LOW-BIT CNN TRAINING VIA KERNEL DENSITY ESTIMATION-GUIDED THRESHOLDING AND HARDWARE-EFFICIENT STOCHASTIC ROUNDING UNIT
Speaker:
Sangbeom Jeong, Seoul National University of Science and Technology, KR
Authors:
Sangbeom Jeong1, Seungil Lee1 and Hyun Kim2
1Seoul National University of Science and Technology, Department of Electrical and Information Engineering, KR; 2Seoul National University of Science and Technology, KR
Abstract
This paper proposes a hardware-efficient INT8 training framework with dual-scale adaptive gradient quantization (DAGQ) to cope with the growing need for efficient on-device CNN training. DAGQ captures both small- and large-magnitude gradients, ensuring robust low-bit training with minimal quantization error. Additionally, to reduce the computational and memory demands of stochastic rounding in low-bit training, we introduce a reusable LFSR-based stochastic rounding unit (RLSRU), which efficiently generates and reuses random numbers, minimizing hardware complexity. The proposed framework achieves stable INT8 training across various networks with minimal accuracy loss while being implementable on RTL-based hardware accelerators, making it well-suited for resource-constrained environments.
17:11 CEST TS15.10 PFASWARE: QUANTIFYING THE ENVIRONMENTAL IMPACT OF PER- AND POLYFLUOROALKYL SUBSTANCES (PFAS) IN COMPUTING SYSTEMS
Speaker:
Mariam Elgamal, Harvard University, US
Authors:
Mariam Elgamal1, Abdulrahman Mahmoud2, Gu-Yeon Wei1, David Brooks1 and Gage Hills1
1Harvard University, US; 2Mohamed bin Zayed University of Artificial Intelligence, AE
Abstract
PFAS (per- and poly-fluoroalkyl substances), also known as forever chemicals, are widely used in electronics and semiconductor manufacturing. PFAS are environmentally persistent and bioaccumulative synthetic chemicals, which have recently received considerable regulatory attention. Manufacturing semiconductors and electronics, including integrated circuits (IC), batteries, displays, etc., currently accounts for a staggering 10% of the total PFAS-containing fluoropolymers used in Europe alone. Now, computer system designers have an opportunity to reduce the use of PFAS in semiconductors and electronics at the design phase. In this work, we quantify the environmental impact of PFAS in computing systems, and outline how designers can optimize their designs to use less PFAS. We show that manufacturing an IC design at a 7 nm technology node using Extreme Ultraviolet (EUV) lithography uses 20% less volume of PFAS-containing chemicals versus manufacturing the same design at a 7 nm node using Deep Ultraviolet (DUV) immersion lithography (instead of EUV). We also show that manufacturing an IC design at a 16 nm technology node results in 15% less volume of PFAS than manufacturing the same design at a 28 nm node due to its smaller area.
17:12 CEST TS15.11 FAST MACHINE LEARNING BASED PREDICTION FOR TEMPERATURE SIMULATION USING COMPACT MODELS
Speaker:
Ayse Coskun, Boston University, US
Authors:
Mohammadamin Hajikhodaverdian1, Sherief Reda2 and Ayse Coskun1
1Boston University, US; 2Brown University, US
Abstract
As transistor densities increase, managing thermal challenges in 3D IC designs becomes more complex. Traditional methods like finite element methods and compact thermal models (CTMs) are computationally expensive, while existing machine learning (ML) models require large datasets and a long training time. To address these challenges with the ML models, we introduce a novel ML framework that integrates with CTMs to accelerate steady-state thermal simulations without needing large datasets. Our approach achieves up to 70× speedup over state-of-the-art simulators, enabling real-time, high-resolution thermal simulations for 2D and 3D IC designs.
17:13 CEST TS15.12 CPP-SGS :CYCLE-ACCURATE POWER PREDICTION FRAMEWORK VIA SNN AND GENETIC SIGNAL SELECTION
Speaker:
Tong LIU, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Tong Liu1, Zijun JIANG2 and Yangdi Lyu1
1The Hong Kong University of Science and Technology (Guangzhou), CN; 2Hong Kong University of Science & Technology (Guangzhou), CN
Abstract
Effective power management is crucial for optimizing the performance and longevity of integrated circuits. Cycle-accurate power prediction can help power management during runtime. This paper introduces a Cycle-accurate Power Prediction framework via Spiking neural networks (SNNs) and Genetic signal Selection (CPP-SGS), which integrates SNNs and Genetic Algorithms (GAs) to predict real-time power consumption of chips. We apply GAs to select the most relevant signals as the input to SNNs to reduce the model size and inference time, making it well-suited for dynamic power estimation in real-time scenarios. The experimental results show that CCP-SGS outperforms the state-of-the-art approaches, with a normalized root mean squared error (NRMSE) of less than 1.6%.

TS16 Design, Test, Modeling and Mitigation of defects and faults

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS16.1 FUSIS: FUSING SURROGATE MODELS AND IMPORTANCE SAMPLING FOR EFFICIENT YIELD ESTIMATION
Speaker:
Wei Xing, The University of Sheffield, GB
Authors:
Yanfang Liu1 and Wei Xing2
1Beihang University, CN; 2The University of Sheffield, GB
Abstract
As process nodes continue to shrink, yield estimation has become increasingly critical in modern circuit design. Traditional approaches face significant challenges: surrogate-based methods often struggle with robustness and accuracy, whereas importance sampling (IS)-based methods suffer from high simulation costs. To address these challenges simultaneously, we propose FUSIS, a unified framework that combines the strengths of surrogate-based and IS-based approaches. Unlike conventional surrogate-based methods that directly replace SPICE simulations for performance predictions, FUSIS employs a Deep Kernel support vector machine (SVM) as an approximation of the indicator function, which is further utilized to construct a quasi-optimal proposal distribution for IS to accelerate convergence. To further mitigate yield estimation bias caused by surrogate inaccuracies, we introduce a novel correction factor to adjust the IS-based yield estimation. Experiments conducted on SRAM and analog circuits demonstrate that FUSIS significantly improves accuracy by up to 24.84% (8.67% on average) while achieving up to 29.54x (10.30x on average) speedup in efficiency compared to seven state-of-the-art methods.
16:35 CEST TS16.2 ROTA: ROTATIONAL TORUS ACCELERATOR FOR WEAR LEVELING OF NEURAL PROCESSING ELEMENTS
Speaker:
Taesoo Lim, Yonsei University, KR
Authors:
Taesoo Lim, Hyeonjin Kim, JINGU PARK, Bogil Kim and William Song, Yonsei University, KR
Abstract
This paper introduces a reliability-aware neural accelerator design with a wear-leveling solution that balances the utilization of processing elements (PEs). Neural accelerators deploy many PEs to exploit data-level parallelism, but their designs and operations have focused mostly on performance and energy efficiency metrics. Directional dataflows in PE arrays and dimensional misalignment with variable-sized neural layers cause the underutilization of PEs, which is biased to PE locations and gradually accumulated over time. Consequently, the accelerators experience severe usage imbalance between PEs. To resolve the problem, this paper proposes a rotational torus accelerator (RoTA) with an optimized wear-leveling scheme that shuffles PE utilization spaces to eliminate PE usage imbalance. Evaluation results show that RoTA improves lifetime reliability by 1.69x.
16:40 CEST TS16.3 LOCATION IS ALL YOU NEED: EFFICIENT LITHOGRAPHIC HOTSPOT DETECTION USING ONLY POLYGON LOCATIONS
Speaker:
Kang Liu, Huazhong University of Science and Technology, CN
Authors:
Yujia Wang1, Jiaxing Wang1, Dan Feng1, Yuzhe Ma2 and Kang Liu1
1Huazhong University of Science and Technology, CN; 2The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
With integrated circuits at advanced technology nodes shrinking in feature size, lithographic hotspot detection has become increasingly important. Deep learning, especially convolutional neural networks (CNNs) and graph neural networks (GNNs) have recently succeeded in lithographic hotspot detection, where layout patterns, represented as images or graph features, are classified into hotspots and non-hotspots. However, with increasingly sophisticated CNN architectural designs, CNN-based hotspot detection requires excessive training and inference costs with expanding model sizes but only marginally improves detection accuracy. Existing GNN-based hotspot detector requires more intuitive and efficient layout graph feature representation. Driven by the understanding that lithographic hotspots result from complex interactions among metal polygons through the light system, we propose the absolute and relative locations of metal polygons are all we need to detect hotspots of a layout clip. We propose a novel layout graph feature representation for hotspot detection where the coordinates of each polygon and the distances between them are taken as node and edge features, respectively. We design an advanced GNN architecture using graph attention and different feature update functions for different edge types of polygons. Our experimental results demonstrate that our architecture achieves the highest hotspot accuracy and the lowest false alarm on different datasets. Notably, we employ one-third of the graph features of the previous GNN hotspot detector and achieve higher accuracy. We outperform all CNN hotspot detectors with higher accuracy, up to 32x speed up in inference time, and 64x reduction in model size.
16:45 CEST TS16.4 EFFICIENT MODULATED STATE SPACE MODEL FOR MIXED-TYPE WAFER DEFECT PATTERN RECOGNITION
Speaker:
Mu Nie, Anhui Polytechnic University, CN
Authors:
Mu Nie1, ShiDong Zhu1, Aibin Yan2, Zhuo Chen3, Xiaoqing Wen4 and Tianming Ni1
1Anhui Polytechnic University, CN; 2Hefei University of Technology, CN; 3Zhejiang University, CN; 4Kyushu Institute of Technology, JP
Abstract
Accurate and efficient wafer defect detection is crucial in semiconductor manufacturing to maintain product quality and optimize yield. Traditional methods struggle with the complexity and diversity of modern wafer defect patterns. While deep learning approaches are effective, they are often resource-intensive, posing challenges for real-time deployment in industrial settings. To solve these problems, we propose an Efficient Modulated State Space Model (EM-SSM) for mixed-type wafer defect recognition, optimized with knowledge distillation to balance accuracy and efficiency. Our framework captures size-dependent relationships and improves defect-specific feature representation to recognize complex defects precisely. Specifically, we introduce an efficient directional modulation mechanism to refine spatial recognition of defect patterns. To further improve inference efficiency, we propose a deep-to-shallow distillation method that transfers knowledge from deeper networks to lighter networks, reducing inference time without compromising classification accuracy. Experimental results on the MixedWM38 wafer dataset with 38 defect types show that our model achieves 99.0\% accuracy, outperforming traditional methods in both accuracy and efficiency. Our model offers a scalable solution for modern semiconductor defect detection.
16:50 CEST TS16.5 MORE-STRESS: MODEL ORDER REDUCTION BASED EFFICIENT NUMERICAL ALGORITHM FOR THERMAL STRESS SIMULATION OF TSV ARRAYS IN 2.5D/3D IC
Speaker:
Tianxiang Zhu, Peking University, CN
Authors:
Tianxiang Zhu, Qipan Wang, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN
Abstract
Thermomechanical stress induced by through-silicon vias (TSVs) plays an important role in the performance and reliability analysis of 2.5D/3D ICs. While the finite element method (FEM) adopted by commercial software can provide accurate simulation results, it is very time- and memory-consuming for large-scale analysis. Over the past decade, the linear superposition method has been utilized to perform fast thermal stress estimations of TSV arrays, but it suffers from a lack of accuracy. In this paper, we propose MORE-Stress, a novel strict numerical algorithm for efficient thermal stress simulation of TSV arrays based on model order reduction. Experimental results demonstrate that our algorithm can realize a 153-504x reduction in simulation time and a 39-115x reduction in memory usage compared with the commercial software ANSYS, with negligible errors less than 1%. Our algorithm is as efficient as the linear superposition method, with an order of magnitude smaller errors and fast convergence.
16:55 CEST TS16.6 DYNAMIC IR-DROP PREDICTION THROUGH A MULTI-TASK U-NET WITH PACKAGE EFFECT CONSIDERATION
Speaker:
Yu-Hsuan Chen, National Tsing Hua University, Taiwan, TW
Authors:
Yu-Hsuan Chen1, Yu-Chen Cheng1, Yong-Fong Chang2, Yu-Che Lee1, Jia-Wei Lin2, Hsun-Wei Pao2, Peng-Wen Chen2, Po-Yu Chen2, Hao-Yun Chen2, Yung-Chih Chen3, Chun-Yao Wang1 and Shih-Chieh Chang1
1National Tsing Hua University, TW; 2Mediatek Inc, Taiwan, TW; 3National Taiwan University of Science and Technology, TW
Abstract
Dynamic IR drop analysis is a critical step in the design signoff stage for verifying the power integrity of a chip. Since the analysis is extremely time-consuming, it has led to the emergence of machine learning (ML)-based methods to expedite the procedure. While previous ML approaches have demonstrated the feasibility of IR drop prediction, they often neglect package effects and do not address diverse IR criteria for memory and standard cells. Thus, this paper introduces a novel ML-based approach designed for a fast and accurate prediction of multi-type IR drop, considering package effects. We develop new package- related features to account for the package impact on IR drop. The proposed model is based on a multi-task U-net architecture that not only predicts two types of IR drops simultaneously but also increases prediction accuracy through comprehensive learning. To further enhance the model performance, we introduce the Input Fusion Block (IFB), which unifies units across channels within the input feature maps, leading to improved prediction accuracy. The experimental results show the across-pattern transferability of the proposed IR drop prediction method, demonstrating an RMSE of less than 5mV and an MAE of less than 2mV on the unseen simulation patterns. Additionally, our proposed method achieves a 5X speed-up compared to the commercial tool.
17:00 CEST TS16.7 MINIMUM TIME MAXIMUM FAULT COVERAGE TESTING OF SPIKING NEURAL NETWORKS
Speaker:
Spyridon Raptis, Sorbonne Université, CNRS, LIP6, FR
Authors:
Spyridon Raptis1 and Haralampos-G. Stratigopoulos2
1Sorbonne Université, CNRS, LIP6, FR; 2Sorbonne University, CNRS, LIP6, FR
Abstract
We present a novel test generation algorithm for hardware accelerators of Spiking Neural Networks (SNNs). The algorithm is based on advanced optimization tailored for the spiking domain. It adaptively crafts input samples towards high coverage of hardware-level faults. Time-consuming fault simulation during test generation is circumvented by defining loss functions targeting the maximization of fault sensitisation and fault effect propagation to the output. Comparing the proposed algorithm to the existing ones on three benchmarks, it scales up for large SNN models, and it drastically reduces the test generation runtime from days to hours and the test duration from minutes to seconds. The resultant test input shows near perfect fault coverage and has a duration equivalent to a few dataset samples, thus, besides post-manufacturing testing, it is also suited for in-field testing.
17:05 CEST TS16.8 EGIS: ENTROPY GUIDED IMAGE SYNTHESIS FOR DATASET-AGNOSTIC TESTING OF RRAM-BASED DNNS
Speaker:
Anurup Saha, Georgia Tech, US
Authors:
Anurup Saha, Chandramouli Amarnath, Kwondo Ma and Abhijit Chatterjee, Georgia Tech, US
Abstract
While resistive random access memory (RRAM) based deep neural networks (DNN) are important for low-power inference in IoT and edge applications, they are vulnerable to the effects of manufacturing process variations that degrade their performance (classification accuracy). However, to test the same post-manufacture, the (image) dataset used to train the associated machine learning applications may not be available to the RRAM crossbar manufacturer for privacy reasons. As such, the performance of DNNs needs to be assessed with carefully crafted dataset-agnostic synthetic test images that expose anomalies in the crossbar manufacturing process to the maximum extent possible. In this work, we propose a dataset-agnostic post-manufacture testing framework for RRAM-based DNNs using Entropy Guided Image Synthesis (EGIS). We first create a synthetic image dataset such that the DNN outputs corresponding to the synthetic images minimize an entropy-based loss metric. Next, a small subset (consisting of 10-20 images) of the synthetic image dataset, called the compact image dataset, is created to expedite testing. The response of the device under test (DUT) to the compact image dataset is passed to a machine learning based outlier detector for pass/fail labeling of the DUT. It is seen that the test accuracy using such synthetic test images is very close to that of contemporary test methods.
17:10 CEST TS16.9 NVSRLO: A FEFET-BASED NON-VOLATILE AND SEU-RECOVERABLE LATCH DESIGN WITH OPTIMIZED OVERHEAD
Speaker:
Wangjin Jiang, Hefei University of Technology, CN
Authors:
Aibin Yan1, Wangjin Jiang1, Han Bao1, Zhengfeng Huang1, Tianming Ni2, Xiaoqing Wen3 and Patrick Girard4
1Hefei University of Technology, CN; 2Anhui Polytechnic University, CN; 3Kyushu Institute of Technology, JP; 4LIRMM, FR
Abstract
This paper presents a FeFET-based non-volatile and single-event upset (SEU) recoverable latch, namely NVSRLO, which does not require any extra control signals. Simulation results show that the proposed latch provides non-volatility and SEU-recovery with optimized overhead. Compared with existing non-volatile latches, NVSRLO significantly reduces delay, power, and delay-power-area product at the cost of area.
17:11 CEST TS16.10 INTERA-ECC: INTERCONNECT-AWARE ERROR CORRECTION IN STT-MRAM
Speaker:
Surendra Hemaram, Karlsruhe Institute of Technology, DE
Authors:
Surendra Hemaram1, Mahta Mayahinia1, Mehdi Tahoori1, Francky Catthoor2, Siddharth Rao2, Sebastien Couet2, Tommaso Marinelli3, Anita Farokhnejad2 and Gouri Kar2
1Karlsruhe Institute of Technology, DE; 2IMEC, BE; 3imec, BE
Abstract
Spin-transfer torque magnetic random access memory (STT-MRAM) is a promising alternative to existing memory technologies. However, STT-MRAM faces reliability challenges, primarily due to stochastic switching, process variation, and manufacturing defects. These reliability challenges become even worse due to interconnect parasitic resistive-capacitive effects, potentially compromising the reliability of memory cells located far from the write driver. This can severely impair the manufacturing yield and large-scale industrial adoption. To address this, we propose an interconnect-aware error correction coding (InterA-ECC), which provides non-uniform error correction to a different zone of the memory subarray. The proposed InterA-ECC strategy selectively applies robust error-correction code (ECC) to specific rows within the subarray rather than uniformly across all rows, reducing ECC parity bits while enhancing bit error rate resiliency in the most vulnerable memory zone.
17:12 CEST TS16.11 ASSESSING SOFT ERROR RELIABILITY IN VECTORIZED KERNELS: VULNERABILITY AND PERFORMANCE TRADE-OFFS ON ARM AND RISC-V ISAS
Speaker and Author:
Geancarlo Abich, UFRGS, BR
Abstract
The demand for advanced processing capabilities is paramount in the ever-evolving landscape of radiation-resilient computing exploration. With the standardization of vector extensions on Arm and Risc-V ISAs, leading technology companies are adopting high-performance processors to exploit vector capabilities. In this regard, this work proposes an automatized register's cross-section reliability evaluation while extending the uniform random register file fault injection to assess the increased vulnerability with the vector register length. Such a technique enables soft error reliability assessment of vector extensions from RISC-V and Arm while comparing with scalar counterparts over different integer and FP precisions. The obtained results show that soft error criticality correlates to registers' cross-section, and the vectorized benchmarks presented up to 78% error susceptibility in comparison to 6% in scalar versions while varying according to precision. This emphasizes the necessity of balancing performance and reliability in the emerging onboard platforms with vector capabilities.
17:13 CEST TS16.12 EARLY FUNCTIONAL SAFETY AND PPA EVALUATION OF DIGITAL DESIGNS
Speaker:
Michelangelo Bartolomucci, Politecnico di Torino, IT
Authors:
Michelangelo Bartolomucci1, David Kingston2, Teo Cupaiuolo3, Alessandra Nardi4 and Riccardo Cantoro1
1Politecnico di Torino, IT; 2Synopsys, GB; 3Synopsys, IT; 4Synopsys, US
Abstract
The use of semiconductor devices in safety-critical scenarios is increasing in both quantity and complexity. This paper presents a novel approach to support safety requirements from RTL exploration through to implementation, with the aid of a Safety Specification Format (SSF), thereby minimizing costly development iterations and reducing the Time-To-Market. An assessment of the results is given for the CV32E40P open source RISC-V processor.

DP DATE Party

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 19:30 CEST - 23:00 CEST


Wednesday, 02 April 2025

ES Executive Session

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST


FS07 Focus Session - European Startups on AI: Path to Success

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST

Session chair:
Anton Klotz, Fraunhofer, DE

Session co-chair:
Marco Inglardi, Synopsys, IT

AI is one of the hottest and influential topics of this decade. Specialized hardware is required to run AI algorithms and several companies are working on designing such hardware, among these companies there are several European startups. There are multiple challenges that these startups have to overcome like lack of financing, lack of skilled workforce, challenge to find interested customer, who take the risk and work with a startup and not with an established market leader. In this session several startups come to word and tell how they have managed to overcome the challenges. We also have perspectives from a commercial startup incubator, which has specialized on microelectronics startups and from academic, who has spin-off several startups. After impulse presentations, there will be a panel discussion, where the panelists will answer the questions from the audience on the landscape of microelectronic startups in Europe.

Participants:
Manu Nair, Synthara, CH
Patrick Couvert, NEUrXCORE, FR
Sean Redmond, Silicon Catalyst, UK
David Atienza, EPFL, CH
Edith Euan Diaz, Axelera, NL


LBR01 Late Breaking Results

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST


SD03 Special Day on Emerging Computing Paradigms

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST

Session chair:
John Paul Strachan, Forschungszentrum Juelich GmbH, DE

Time Label Presentation Title
Authors
08:30 CEST SD03.1 OPENING AND INTRODUCTION TO THE SPECIAL DAY
Presenter:
John Paul Strachan, Forschungszentrum Juelich GmbH, DE
Author:
John Paul Strachan, Forschungszentrum Juelich GmbH, DE
Abstract
.
08:53 CEST SD03.2 ERROR PROPAGATION THROUGH SPACE, TIME AND THE BRAIN
Presenter:
Mihai Petrovici, University of Bern, CH
Author:
Mihai Petrovici, University of Bern, CH
Abstract
.
09:15 CEST SD03.3 SPINTRONIC NEURAL NETWORKS
Presenter:
Julie Grollier, CNRS/Thales, FR
Author:
Julie Grollier, CNRS/Thales, FR
Abstract
.
09:38 CEST SD03.4 SILICON PHOTONICS FOR AI - THE GOOD, THE BAD AND THE UGLY
Presenter:
Thomas Van Vaerenbergh, Hewlett Packard Labs, BE
Author:
Thomas Van Vaerenbergh, Hewlett Packard Labs, BE
Abstract
.

SoCL SoC Labs: The academic community for System on Chip Development

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 12:30 CEST

8:30-8:40 Welcome
8:40-10:20 Presentation of selected teams of the "Understanding our world" research and education design contest
10:20-12:00 Education: How to build SoC project throughout all design phases until tape out
12:00-12:30 Presentation of the 2026 SoC Labs contest


TS17 System simulation and validation

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST

Time Label Presentation Title
Authors
08:30 CEST TS17.1 FLOPPYFLOAT: AN OPEN SOURCE FLOATING POINT LIBRARY FOR INSTRUCTION SET SIMULATORS
Speaker:
Niko Zurstraßen, RWTH Aachen University, DE
Authors:
Niko Zurstraßen, Nils Bosbach and Rainer Leupers, RWTH Aachen University, DE
Abstract
Instruction Set Simulators (ISSs) are important software tools that facilitate the simulation of arbitrary compute systems. One of the most challenging aspects of ISS development is the modeling of Floating Point (FP) arithmetic. Despite an industry standard specifically created to avoid fragmentation, every Instruction Set Architecture (ISA) comes with an individual definition of FP arithmetic. Hence, many simulators, such as gem5 or Spike, do not use the Floating Point Unit (FPU) of the host system, but resort to soft float libraries. These libraries offer great flexibility and portability by calculating FP instructions by means of integer arithmetic. However, using tens or hundreds of integer instructions to model a single FP instruction is detrimental to the simulator's performance. Tackling the poor performance of soft float libraries, we present FloppyFloat - an open-source FP library for ISSs. FloppyFloat leverages the host FPU for basic calculations and rectifies corner cases in software. In comparison to the popular Berkeley SoftFloat, FloppyFloat achieves speedups of up to 5.5x for individual instructions. As a replacement for SoftFloat in the RISC-V golden reference simulator Spike, FloppyFloat accelerates common FP benchmarks by up to 1.41x.
08:35 CEST TS17.2 HANDLING LATCH LOOPS IN TIMING ANALYSIS WITH IMPROVED COMPLEXITY AND DIVERGENT LOOP DETECTION
Speaker:
Xizhe Shi, Peking University, CN
Authors:
Xizhe Shi, Zizheng Guo, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN
Abstract
Latch loops introduce feedback cycles in timing graphs for static timing analysis (STA), disrupting timing propagation in topological order. Existing timers handle latch loops by checking the convergence of global iterations in timing propagation without lookahead detection of divergent loops. Such a strategy ends up with the worst-case runtime complexity O(n²), where n is the number of pins in the timing graph. This can be extremely time-consuming, when n goes to millions and beyond. In this paper, we address this challenge by proposing a new algorithm consisting of two steps. First, we identify the strongly connected components (SCCs) and levelize them into different stages. Second, we implement parallelized arrival time (AT) propagation between SCCs while conducting sequential iterations inside each SCC. This strategy significantly reduces the runtime complexity to O(∑(k_i)²) from the previous global propagation, where k_i is the number of pins in each SCC. Our timer also detects timing information divergent loops in advance, avoiding over-iteration. Experimental results on industrial designs demonstrate 10.31× and 8.77× speed-up over PrimeTime and OpenSTA on average, respectively.
08:40 CEST TS17.3 STATIC GLOBAL REGISTER ALLOCATION FOR DYNAMIC BINARY TRANSLATORS
Speaker:
Niko Zurstraßen, RWTH Aachen University, DE
Authors:
Niko Zurstraßen, Nils Bosbach, Lennart Reimann and Rainer Leupers, RWTH Aachen University, DE
Abstract
Dynamic Binary Translators (DBTs) facilitate the execution of binaries across different Instruction Set Architectures (ISAs). Similar to a just-in-time compiler, they recompile machine code from one ISA to another, and subsequently execute the generated code. To achieve near-native execution speed, several challenges must be overcome. This includes the problem of register allocation (RA). In classical compiler engineering, RA is often performed by global methods. However, due to the nature of DBTs, established global methods like graph coloring or linear scan are hardly applicable. This is why state-of-the-art DBTs, like QEMU, use basic-block-local methods, which come with several disadvantages. Addressing these flaws, we propose a novel global method based on static target-to-host mappings. As most applications only work on a small set of registers, mapping them statically from host to target significantly reduces load/store overhead. In a case study using our RISC-V-on-ARM64 user- mode simulator RISE SIM, we demonstrate speedups of up to 1.4× compared to basic-block-local methods.
08:45 CEST TS17.4 CORRECTBENCH: AUTOMATIC TESTBENCH GENERATION WITH FUNCTIONAL SELF-CORRECTION USING LLMS FOR HDL DESIGN
Speaker:
Ruidi Qiu, TU Munich, DE
Authors:
Ruidi Qiu1, Grace Li Zhang2, Rolf Drechsler3, Ulf Schlichtmann1 and Bing Li4
1TU Munich, DE; 2TU Darmstadt, DE; 3University of Bremen | DFKI, DE; 4University of Siegen, DE
Abstract
Functional simulation is an essential step in digital hardware design. Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for hardware testbench generation tasks. However, the inherent instability associated with LLMs often leads to functional errors in the generated testbenches. Previous methods do not incorporate automatic functional correction mechanisms without human intervention and still suffer from low success rates, especially for sequential tasks. To address this issue, we propose CorrectBench, an automatic testbench generation framework with functional self-validation and self-correction. Utilizing only the RTL specification in natural language, the proposed approach can validate the correctness of the generated testbenches with a success rate of 88.85%. Furthermore, the proposed LLM-based corrector employs bug information obtained during the self-validation process to perform functional self-correction on the generated testbenches. The comparative analysis demonstrates that our method achieves a pass ratio of 70.13% across all evaluated tasks, compared with the previous LLM-based testbench generation framework's 52.18% and a direct LLM-based generation method's 33.33% Specifically in sequential circuits, our work's performance is 62.18% higher than previous work in sequential tasks and almost 5 times the pass ratio of direct method. The codes and experimental results are open-sourced at the link: https://anonymous.4open.science/r/CorrectBench-8CEA.
08:50 CEST TS17.5 CISGRAPH: A CONTRIBUTION-DRIVEN ACCELERATOR FOR PAIRWISE STREAMING GRAPH ANALYTICS
Speaker:
Songyu Feng, Institute of Computing Technology, Chinese Academy of Sciences, CN
Authors:
Songyu Feng1, Mo Zou2 and Tian Zhi2
1Institute of Compuiting Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
Abstract—Recent research observed that pairwise query is practical enough in real-world streaming graph analytics. Given a pair of distinct vertices, existing approaches coalesce or prune vertex activations to decrease computations. However, they still suffer from severe invalid computations because they ignore contribution variations in graph updates, hindering performance improvement. In this work, we propose to enhance pairwise analytics by taking updates contributions into account. We first identify that graph updates from one batch have a distinct impact on query results and experience obvious diverse computation overheads. We then introduce CISGraph, a novel Contributiondriven pairwise accelerator with valuable updates Identification and Scheduling. Specifically, inspired by triangle inequality, CISGraph categorizes graph updates into three levels according to contributions, prioritizes valuable updates, delays possible valuable updates, and drops useless updates to eliminate wasteful computations. As far as we know, CISGraph is the first hardware accelerator that supports efficient pairwise queries on streaming graphs. Experimental results show that CISGraph substantially outperforms state-of-the-art streaming graph processing systems by 25× on average in response time.
08:55 CEST TS17.6 HIGH-PERFORMANCE ARM-ON-ARM VIRTUALIZATION FOR MULTICORE SYSTEMC-TLM-BASED VIRTUAL PLATFORMS
Speaker:
Nils Bosbach, RWTH Aachen University, DE
Authors:
Nils Bosbach1, Rebecca Pelke1, Niko Zurstraßen1, Jan Weinstock2, Lukas Jünger2 and Rainer Leupers1
1RWTH Aachen University, DE; 2MachineWare GmbH, DE
Abstract
The increasing complexity of hardware and software requires advanced development and test methodologies for modern systems on chips. This paper presents a novel approach to ARM-on-ARM virtualization within SystemC-based simulators using Linux's KVM to achieve high-performance simulation. By running target software natively on ARM-based hosts with hardware-based virtualization extensions, our method eliminates the need for instruction-set simulators, which significantly improves performance. We present a multicore SystemC-TLM-based CPU model that can be used as a drop-in replacement for an instruction-set simulator. It places no special requirements on the host system, making it compatible with various environments. Benchmark results show that our ARM-on-ARM-based virtual platform achieves up to 10 x speedup over traditional instruction-set-simulator-based models on compute-intensive workloads. Depending on the benchmark, speedups increase to more than 100 x.
09:00 CEST TS17.7 RTHETER: SIMULATING REAL-TIME SCHEDULING OF MULTIPLE TASKS IN HETEROGENEOUS ARCHITECTURES
Speaker:
Yinchen Ni, Shanghai Jiao Tong University, CN
Authors:
Yinchen Ni1, Jiace Zhu1, Yier Jin2 and An Zou1
1Shanghai Jiao Tong University, CN; 2University of Science and Technology of China, CN
Abstract
The rising popularity of AI applications is driving the adoption of heterogeneous computing architectures to handle complex computations. However, as these heterogeneous architectures grow more complex, optimizing the scheduling of multiple tasks and meeting strict timing constraints becomes significantly challenging. Current studies on real-time scheduling on heterogeneous processors lack agile and flexible simulation tools that can quickly adapt to varying system settings, leading to inefficiencies in system design. Additionally, the high costs associated with evaluating real-time performance in terms of human and facility efforts further complicate the development process. To address these challenges, this paper introduces a comprehensive hierarchical simulating approach and a corresponding simulator designed for flexible heterogeneous computing platforms. The simulator supports ideal or practical, off-the-shelf or customizable heterogeneous architectures, upon which the simulator can execute both parallel and dependent tasks. Utilizing this simulator, we present two case studies that were time-consuming previously but are now easily achieved by the proposed simulator. The first case study reveals the possibility of using policy-based reinforcement learning to explore novel scheduling strategies; the second explores the dominant processors within heterogeneous architectures, providing insights for optimizing the heterogeneous architecture design.
09:05 CEST TS17.8 FAST INTERPRETER-BASED INSTRUCTION SET SIMULATION FOR VIRTUAL PROTOTYPES
Speaker:
Manfred Schlägl, Institute for Complex Systems, Johannes Kepler University Linz, AT
Authors:
Manfred Schlaegl and Daniel Grosse, Johannes Kepler University Linz, AT
Abstract
The Instruction Set Simulators (ISSs) used in Virtual Prototypes (VPs) are typically implemented as interpreters with the goal to be easy to understand, and fast to adapt and extend. However, the performance of instruction interpretation is very limited and the ever-increasing complexity of Hardware (HW) poses an increasing challenge to this approach. In this paper, we present optimization techniques for interpreter-based ISSs that significantly boost performance while preserving comprehensibility and adaptability. We consider the RISC-V ISS of an existing, SystemC-based open-source VP with extensive capabilities such as running Linux and interactive graphical applications. The optimization techniques feature a Dynamic Basic Block Cache (DBBCache) to accelerate ISS instruction processing and a Load/Store Cache (LSCache) to speed up ISS load and store operations to and from memory. In our evaluation, we consider 12 Linux-based benchmark workloads and compare our optimizations to the original VP as well as to the very efficient official RISC-V reference simulator Spike maintained by RISC-V International. Overall, we achieve up to 406.97 Million Instructions per Second (MIPS) and a signif- icant average performance increase, by a factor of 8.98 over the original VP and 1.65 over the Spike simulator. To showcase the retention of both comprehensibility and adaptability, we imple- ment support for RISC-V half-precision floating-point extension (Zfh) in both the original and the optimized VP. A comparison of these implementations reveals no significant differences, ensuring that the stated qualities remain unaffected. The optimized VP including Zfh is available as open-source on GitHub.
09:10 CEST TS17.9 C2C-GEM5: FULL SYSTEM SIMULATION OF CACHE-COHERENT CHIP-TO-CHIP INTERCONNECTS
Speaker:
Luis Bertran Alvarez, LIRMM, FR
Authors:
Luis Bertran Alvarez1, Ghassan Chehaibar2, Stephen Busch2, Pascal Benoit3 and David Novo3
1LIRMM / Eviden, FR; 2Eviden, FR; 3Université de Montpellier, FR
Abstract
High-Performance Computing (HPC) is shifting toward chiplet-based System-on-Chip (SoC) architectures, necessitating advanced simulation tools for design and optimization. In this work, we extend the gem5 simulator to support cache-coherent multi-chip systems by introducing a new chip-to-chip interconnect model within the Ruby framework. Our implementation is adaptable to various coherence protocols, such as Arm CHI. Calibrated with real hardware, our model is evaluated using PARSEC workloads, demonstrating its accuracy in simulating coherent chip-to-chip interactions and its effectiveness in capturing key performance metrics early in the design flow.
09:15 CEST TS17.10 A 101 TOPS/W AND 1.73 TOPS/MM$^2$ 6T SRAM-BASED DIGITAL COMPUTE-IN-MEMORY MACRO FEATURING A NOVEL 2T MULTIPLIER
Speaker:
Priyanshu Tyagi, IIT Roorkee, IN
Authors:
Priyanshu Tyagi and Sparsh Mittal, IIT Roorkee, IN
Abstract
In this paper, we propose a 6T SRAM-based all-digital Compute-in-memory (CIM) macro for multi-bit multiply-and-accumulate (MAC) operations. We propose a novel 2T bitwise multiplier, which is a direct improvement over the previously proposed 4T NOR gate-based multiplier. The 2T multiplier also eliminates the need to invert the input bits, which is required when using NOR gates for multipliers. We propose an efficient digital MAC computation flow based on a barrel shifter, which significantly reduces the latency of shift operation. This brings down the overall latency incurred while performing MAC operations to 13ns/25ns (in 65nm CMOS)for 4b/8b operands (in 65nm CMOS @ 0.6V), compared to 10ns/18ns (in 22nm CMOS @ 0.72V) of the previous work. The proposed CIM macro is fully re-configurable in weight bits (4/8/12/16) and input (4/8) bits. It can perform concurrent MAC and weight update operations. Moreover, its fully complete digital implementation circumvents the challenges associated with analog CIM macros. For MAC operation with 4b weight and input, the macro achieves 24 TOPS/W at 1.2 V and 81 TOPS/W at 0.7 V. When using low-threshold-voltage transistors in the 2T multiplier, the macro works reliably even at 0.6V while achieving 101 TOPS/W.

TS18 Machine learning solutions for embedded and cyber-physical systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST

Time Label Presentation Title
Authors
08:30 CEST TS18.1 DE$^2$R: UNIFYING DVFS AND EARLY-EXIT FOR EMBEDDED AI INFERENCE VIA REINFORCEMENT LEARNING
Speaker:
Yuting He, University of Nottingham Ningbo China, CN
Authors:
Yuting He1, Jingjin Li1, Chengtai Li1, Qingyu Yang1, Zheng Wang2, Heshan Du1, Jianfeng Ren1 and Heng Yu1
1University of Nottingham Ningbo China, CN; 2Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, CN
Abstract
Executing neural networks on resource-constrained embedded devices faces challenges. Efforts have been made at the application and system levels to reduce the execution cost. Among them, the early-exit networks reduce computational cost through intermediate exits, while Dynamic Voltage and Frequency Scaling (DVFS) offers system energy reduction. Existing works strive to unify early-exit and DVFS for combined benefits on both timing and energy flexibility, yet limitations exist: 1) varying time constraints that make different exit points become more, or less, important in terms of inference accuracy, are not taken care of, and 2) the optimal decisions of unifying DVFS and early-exit as a multi-objective optimization problem are not achieved due to the large configuration space. To address these challenges, we propose De$^2$r, a reinforcement learning-based framework that jointly optimizes early-exit points and DVFS settings for continuous inference. In particular, De$^2$r includes a cross-training mechanism that fine-tunes the early-exit network to accommodate dynamic time constraints and system conditions. Experimental results demonstrate that De$^2$r achieves up to 22.03% energy reduction and 3.23% accuracy gain compared to contemporary techniques.
08:35 CEST TS18.2 CONTINUOUS GNN-BASED ANOMALY DETECTION ON EDGE USING EFFICIENT ADAPTIVE KNOWLEDGE GRAPH LEARNING
Speaker:
Sanggeon Yun, University of California, Irvine, US
Authors:
Sanggeon Yun1, Ryozo Masukawa1, William Chung1, Minhyoung Na2, Nathaniel Bastian3 and Mohsen Imani1
1University of California, Irvine, US; 2Kookmin University, KR; 3United States Military Academy at West Point, US
Abstract
The increasing demand for robust security solutions across various industries has made Video Anomaly Detection (VAD) a critical task in applications such as intelligent surveillance, evidence investigation, and violence detection. Traditional approaches to VAD often rely on finetuning large pre-trained models, which can be computationally expensive and impractical for real-time or resource-constrained environments. To address this, MissionGNN introduced a more efficient method by training a graph neural network (GNN) using a fixed knowledge graph (KG) derived from large language models (LLMs) like GPT-4. While this approach demonstrated significant efficiency in computational power and memory, it faces limitations in dynamic environments where frequent updates to the KG are necessary due to evolving behavior trends and shifting data patterns. These updates typically require cloud-based computation, posing challenges for edge computing applications. In this paper, we propose a novel framework that facilitates continuous KG adaptation directly on edge devices, overcoming the limitations of cloud dependency. Our method dynamically modifies the KG through a three-phase process: pruning, alternating, and creating nodes, enabling real-time adaptation to changing data trends. This continuous learning approach enhances the robustness of anomaly detection models, making them more suitable for deployment in dynamic and resource-constrained environments.
08:40 CEST TS18.3 BMP-SD: MARRYING BINARY AND MIXED-PRECISION QUANTIZATION FOR EFFICIENT STABLE DIFFUSION INFERENCE
Speaker:
Cheng Gu, Shanghai Jiao Tong University, CN
Authors:
Cheng Gu1, Gang Li2, Xiaolong Lin1, Jiayao Ling1, Jian Cheng3 and Xiaoyao Liang1
1Shanghai Jiao Tong University, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN; 3Institute of Automation, CN
Abstract
Stable Diffusion (SD) is an emerging deep neural network (DNN) model that has demonstrated impressive capabilities in generative tasks such as text-to-image generation. However, the iterative denoising stage with UNet in the SD model is extremely expensive in both computations and memory accesses, making it challenging for fast and energy-efficient edge deployment. To alleviate the overhead of denoising, in this paper we propose BMP-SD, a post-training quantization framework for hardware-efficient SD inference. BMP-SD employs binary weight quantization to significantly reduce the computational complexity and memory footprint of iterative denoising, along with dynamic, step-aware mixed-precision activation quantization, based on the observation that not all denoising steps are equally important. Experiments on the text-to-image generation task show that BMP-SD achieves mixed-precision (W1.73A4.87) with minimal accuracy loss on MS-COCO 2014. We also evaluate the BMP-SD quantized model on multiple bit-flexible DNN accelerators, results reveal that our method can deliver up to 5.14x performance and 3.85x energy efficiency improvements compared to W8A8 quantization.
08:45 CEST TS18.4 DISTRIBUTED INFERENCE WITH MINIMAL OFF-CHIP TRAFFIC FOR TRANSFORMERS ON LOW-POWER MCUS
Speaker:
Victor Jung, ETH Zurich, CH
Authors:
Severin Bochem1, Victor Jung1, Arpan Suravi Prasad1, Francesco Conti2 and Luca Benini3
1ETH Zurich, CH; 2Università di Bologna, IT; 3ETH Zurich, CH | Università di Bologna, IT
Abstract
Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technological revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power microcontroller units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, an above-linear speedup of 26.07 X, and an energy-delay-product (EDP) improvement of 27.22 X, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.84 ms, with an above-linear 4.69 X speedup when using 4 MCUs compared to a single-chip system.
08:50 CEST TS18.5 HIFI-SAGE: HIGH FIDELITY GRAPHSAGE-BASED LATENCY ESTIMATORS FOR DNN OPTIMIZATION
Speaker:
Shambhavi Balamuthu Sampath, BMW Group, DE
Authors:
Shambhavi Balamuthu Sampath1, Leon Hecht2, Moritz Thoma1, Lukas Frickenstein1, Pierpaolo Mori3, Nael Fasfous1, Manoj Rohit Vemparala1, Alexander Frickenstein1, Claudio Passerone3, Daniel Mueller-Gritschneder4 and Walter Stechele2
1BMW Group, DE; 2TU Munich, DE; 3Politecnico di Torino, IT; 4TU Wien, AT
Abstract
As deep neural networks (DNNs) are increasingly deployed on resource-constrained edge devices, optimizing and compressing them for real-time performance becomes crucial. Traditional hardware-aware DNN search methods often rely on inaccurate proxy metrics, expensive latency lookup tables, or slow hardware-in-the-loop (HIL) evaluations. To address this, quasi- generalized latency estimators, typically meta-learning-based, were proposed to replace HIL evaluations and accelerate the search. These come with a one-time data collection and training cost and can adapt to new hardware with few measurements. However, they still have some drawbacks: (1) They increase complexity by trying to generalize across a range of diverse hardware types; (2) They depend on handcrafted hardware descriptors, which may fail to capture hardware characteristics; (3) They often perform poorly on new, unseen hardware that significantly differs from their initial training set. To overcome these challenges, this paper turns to the more straightforward platform-specific estimators that do not require hardware descriptors and can be easily trained on any hardware. We introduce HiFi-SAGE, a high fidelity GraphSAGE-based platform-specific latency estimator. When trained from scratch on only 100 latency measurements, our novel dual-head estimator design surpasses the state-of-the-art (SoTA) on the 10% error bound metric by up to 17.4 p.p. while achieving an impressive fidelity score of 99% on the diverse LatBench dataset. We demonstrate that applying HiFi-SAGE to a genetic algorithm-based DNN compression search, achieved a Pareto front comparable to real HIL feedback with a mean absolute percentage error (MAPE) of 2.54%, 2.48%, and 4.16%, for InceptionV3, DenseNet169, and ResNet50 respectively. Compared to existing platform-specific works, the lower number of latency measurements and higher fidelity scores positions HiFi-SAGE as an attractive alternative to replace expensive HIL setups. Code is available at: https://github.com/shamvbs/HiFi-SAGE.
08:55 CEST TS18.6 SOLARML: OPTIMIZING SENSING AND INFERENCE FOR SOLAR-POWERED TINYML PLATFORMS
Speaker:
Hao Liu, TU Delft, NL
Authors:
Hao Liu, Qing Wang and Marco Zuniga, TU Delft, NL
Abstract
Machine learning models can now run on microcontrollers. Thanks to the advances in neural architectural search, we can automatically identify tiny machine learning (tinyML) models that satisfy stringent memory and energy requirements. However, existing methods often overlook the energy used during event detection and data gathering. This is critical for devices powered by renewable energy sources like solar power, where energy efficiency is paramount. To address it, we introduce SolarML, a solution designed specifically for solar-powered tinyML platforms, which optimizes the end-to-end system's inference accuracy and energy consumption, from data gathering and processing to model inference. Considering two applications of gesture recognition and keywords spotting, SolarML has the following contributions: 1) a hardware platform with an optimal event detection mechanism that reduces event detection costs by up to 10× compared to state-of-the-art alternatives; 2) a joint optimization framework eNAS that reduces the energy consumption of the sensor and inference model by up to 2×, compared to methods that only optimize the inference model. Jointly, they enable SolarML to run end-to-end gesture and audio inference on a battery-free tinyML platform by only harvesting solar energy for 30 and 57 seconds, respectively, in an office environment (500 lux).
09:00 CEST TS18.7 SAFELOC: OVERCOMING DATA POISONING ATTACKS IN HETEROGENEOUS FEDERATED MACHINE LEARNING FOR INDOOR LOCALIZATION
Speaker:
Akhil Singampalli, Colorado State University, US
Authors:
Akhil Singampalli, Danish Gufran and Sudeep Pasricha, Colorado State University, US
Abstract
Machine learning (ML) based indoor localization solutions are critical for many emerging applications, yet their efficacy is often compromised by hardware/software variations across mobile devices (i.e., device heterogeneity) and the threat of ML data poisoning attacks. Conventional methods aimed at countering these challenges show limited resilience to the uncertainties created by these phenomena. In response, we introduce SAFELOC, a novel framework that not only minimizes localization errors under these challenging conditions but also ensures model compactness for efficient mobile device deployment. SAFELOC introduces a novel fused neural network architecture that performs data poisoning detection and localization, with a low model footprint using federated learning (FL). Additionally, a dynamic saliency map-based aggregation strategy is designed to adapt based on the severity of the detected data poisoning scenario. Experimental evaluations demonstrate that SAFELOC achieves improvements of up to 5.9× in mean localization error, 7.8× in worst-case localization error, and a 2.1× reduction in model inference latency compared to state-of-the-art indoor localization frameworks across diverse indoor environments and data poisoning attack scenarios.
09:05 CEST TS18.8 HYBRID TOKEN SELECTOR BASED ACCELERATOR FOR VITS
Speaker:
Anadi Goyal, Indian Institute of Technology Jodhpur, IN
Authors:
Akshansh Yadav, Anadi Goyal and Palash Das, Indian Institute of Technology, Jodhpur, IN
Abstract
Vision Transformers (ViTs) have shown great success in computer vision but suffer from high computational complexity due to the quadratic growth in the number of tokens processed. Token selection/pruning has emerged as a promising solution; however, early methods introduce significant overhead and complexity. Applying a token selector in the early layers of a ViT can yield substantial computational savings (GFLOPs) compared to using it in later layers. However, this approach often leads to significant accuracy loss, particularly with the popular Attention-based Token Selection (ATS) technique. To address these issues, we propose a hybrid token selection (HTS) strategy that integrates our Keypoint-based Token Selection (KTS) with the existing ATS method. KTS dynamically selects important tokens based on image content in the early layers, while ATS refines token pruning in the later layers. This hybrid approach reduces computational costs while maintaining accuracy. Additionally, we design custom hardware modules to accelerate the execution of the proposed methods and the ViT backbone. The proposed HTS delivers a 35.85% reduction in execution time relative to the baseline without any token selection. Furthermore, our results demonstrate that HTS achieves up to a 0.39% increase in accuracy and offers up to 6.05% savings in GFLOPs compared to existing method.
09:10 CEST TS18.9 DAOP: DATA-AWARE OFFLOADING AND PREDICTIVE PRE-CALCULATION FOR EFFICIENT MOE INFERENCE
Speaker:
Yujie Zhang, National University of Singapore, SG
Authors:
Yujie Zhang, Shivam Aggarwal and Tulika Mitra, National University of Singapore, SG
Abstract
Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
09:15 CEST TS18.10 SPIKESTREAM: ACCELERATING SPIKING NEURAL NETWORK INFERENCE ON RISC-V CLUSTERS WITH SPARSE COMPUTATION EXTENSIONS
Speaker:
Simone Manoni, Università di Bologna, IT
Authors:
Simone Manoni1, Paul Scheffler2, Luca Zanatta3, Andrea Acquaviva1, Luca Benini4 and Andrea Bartolini1
1Università di Bologna, IT; 2ETH Zurich, CH; 3NTNU, NO; 4ETH Zurich, CH | Università di Bologna, IT
Abstract
Spiking Neural Network (SNN) inference has a clear potential for high energy efficiency as computation is triggered by events. However, the inherent sparsity of events poses challenges for conventional computing systems, driving the development of specialized neuromorphic processors, which come with high silicon area costs and lack the flexibility needed for running other computational kernels, limiting widespread adoption. In this paper, we explore the low-level software design, parallelization, and acceleration of SNNs on general-purpose multicore clusters with a low-overhead RISC-V ISA extension for streaming sparse computations. We propose SpikeStream, an optimization technique that maps weights accesses to affine and indirect register-mapped memory streams to enhance performance, utilization, and efficiency. Our results on the end-to-end Spiking-VGG11 model demonstrate a significant 4.39× speedup and an increase in utilization from 9.28% to 52.3% compared to a non-streaming parallel baseline. Additionally, we achieve an energy efficiency gain of 3.46× over LSMCore and a performance gain of 2.38× over Loihi.
09:20 CEST TS18.11 REACT: RANDOMIZED ENCRYPTION WITH AI-CONTROLLED TARGETING FOR NEXT-GEN SECURE COMMUNICATION
Speaker:
Hossein Sayadi, California State University, Long Beach, US
Authors:
Zhangying He and Hossein Sayadi, California State University, Long Beach, US
Abstract
This work introduces REACT (Randomized Encryption with AI-Controlled Targeting), a novel framework leveraging Deep Reinforcement Learning (DRL) and Moving Target Defense (MTD) to secure chaotic communication in resource-constrained environments. REACT employs a random generator to dynamically assign encryption modes, creating unpredictable patterns that thwart interception. At the receiver's end, four DRL agents collaborate to identify encryption modes and apply decryption methods, ensuring secure, synchronized communication. Evaluation results demonstrate up to 100% decryption accuracy and a 51% reduction in attack success probability, establishing REACT as a robust and adaptive defense for secure and reliable communication
09:21 CEST TS18.12 DUSGAI: A DUAL-SIDE SPARSE GEMM ACCELERATOR WITH FLEXIBLE INTERCONNECTS
Speaker:
Wujie Zhong, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Wujie Zhong and Yangdi Lyu, The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
Sparse general matrix multiplication (SpGEMM) is a crucial operation of deep neural networks (DNNs), leading to the development of numerous specialized SpGEMM accelerators. These accelerators leverage flexible interconnects, thereby outperforming their rigid counterparts. However, the suboptimal utilization of sparsity patterns limits overall performance efficiency. In this work, we propose DuSGAI, a sparse GEMM accelerator that employs a parallel index intersection structure to utilize dual-side sparsity. Our evaluation of DuSGAI with five popular DNN models demonstrates a 3.03× performance improvement compared to the state-of-the-art SpGEMM accelerator.

TS19 Design and test of secure systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST

Time Label Presentation Title
Authors
08:30 CEST TS19.1 DE2: SAT-BASED SEQUENTIAL LOGIC DECRYPTION WITH A FUNCTIONAL DESCRIPTION
Speaker:
Hai Zhou, Northwestern University, US
Authors:
You Li, Guannan Zhao, Yunqi He and Hai Zhou, Northwestern University, US
Abstract
Logic locking is a promising approach to protect the intellectual properties of integrated circuits. Existing logic locking schemes assume that an adversary must possess a cycle-accurate oracle circuit to launch an I/O attack. This paper presents DE2, a novel and rigorous attacking algorithm based on a new adversarial model. DE2 only takes a high-level functional specification of the victim chip. Such specifications are increasingly prevalent in the modern IC design flow. DE2 closes the timing gap between the specification and the circuit with an automatic alignment mechanism, which enables effective logic decryption without cycle-accurate information. An essential enabler of DE2 is a synthesis-based sequential logic decryption algorithm called LIM, which introduces only a minimal overhead in every iteration. Experiments show that DE2 can efficiently attack logic-locked benchmarks without access to a cycle-accurate oracle circuit. Besides, LIM can solve 20% more ISCAS'89 benchmarks than state-of-the-art sequential logic decryption algorithms.
08:35 CEST TS19.2 HARDWARE/SOFTWARE RUNTIME FOR GPSA PROTECTION IN RISC-V EMBEDDED CORES
Speaker:
Louis Savary, INRIA, FR
Authors:
Louis Savary1, Simon Rokicki2 and Steven Derrien3
1INRIA, FR; 2IRISA, FR; 3Université de Bretagne Occidentale | Lab-STICC, FR
Abstract
State-of-the-art hardware countermeasures against fault attacks are based, among others, on control flow and code integrity checking. Generalized Path Signature Analysis and Continuous Signature Monitoring can assert these integrity properties. However, supporting such mechanisms requires a dedicated compiler flow and does not support indirect jumps. This work proposes a technique based on a hardware/software runtime to generate those signatures while executing unmodified off-the-shelf RISC-V binaries. The proposed approach has been implemented on a pipelined processor, and experimental results show an average slowdown of x3 compared to unprotected implementations while being completely compiler-independent.
08:40 CEST TS19.3 ANALOG CIRCUIT ANTI-PIRACY SECURITY BY EXPLOITING DEVICE RATINGS
Speaker:
Hazem Hammam, Sorbonne Université, CNRS, LIP6, FR
Authors:
Hazem Hammam1, Hassan Aboushady1 and Haralampos-G. Stratigopoulos2
1Sorbonne Université, CNRS, LIP6, FR; 2Sorbonne University, CNRS, LIP6, FR
Abstract
We propose a novel anti-piracy security technique for analog and mixed-signal (AMS) circuits. The circuit is re-designed by obfuscating transistors and capacitors with key-controlled versions. We obfuscate both the device geometries and their ratings, which define the maximum allowable current, voltage, and power dissipation. The circuit is designed to function correctly only with a specific key. Loading any other incorrect key degrades performance and for the vast majority of these keys the chip is damaged because of electrical over-stress. This prevents counter-attacks that employ a chip to search for the correct key. The methodology is demonstrated on a low-dropout regulator (LDO) designed in the 22nm FDSOI technology by GlobalFoundries. By locking the LDO, the entire chip functionality breaks unless the LDO is unlocked first. The secured LDO shows no performance penalty and area overhead is justifiable and less than 25%, while it is protected against all known counter-attacks in the AMS domain.
08:45 CEST TS19.4 SIDE-CHANNEL COLLISION ATTACKS AGAINST ASCON
Speaker:
Hao Zhang, Nanjing University of Science and Technology, CN
Authors:
Hao Zhang, Yiwen Gao, Yongbin Zhou and Jingdian Ming, Nanjing University of Science and Technology, CN
Abstract
Side-channel attack poses a significant threat to the security of electronic devices, particularly IoT/AIoT terminals. By leveraging side-channel leakages, collision attacks can efficiently extract the secret keys from cryptographic devices while requiring considerably less computational effort. In this paper, we investigate side-channel collision attacks against ASCON, a lightweight crypto designed for resource-constrained devices, which has been standardized by the NIST. For the first time, we propose a side-channel key recovery attack against ASCON by identifying the collisions in the linear diffusion layer. Using Pearson correlation coefficient and Euclidean distance for internal collision detections, our attack successfully recovers the secret key with approximately 5,000 power traces from an 8-bit software implementation on an AVR device. To further reduce attack complexity, we introduce a novel metric, Locally-Weighted Sum (LWS), which focuses on the most likely points of leakage, thereby decreasing the number of required power traces for successful attack. Our experiment on the same target demonstrates that the LWS-based collision attack can recover the full secret key with approximately 3,000 power traces, a reduction of 40 percent. Our study indicates that ASCON is susceptible to side-channel collision attacks, and bitslice implementations remain vulnerable to such threats.
08:50 CEST TS19.5 CUTE-LOCK: BEHAVIORAL AND STRUCTURAL MULTI-KEY LOGIC LOCKING USING TIME BASE KEYS
Speaker:
Amin Rezaei, California State University, Long Beach, US
Authors:
Kevin Lopez and Amin Rezaei, California State University, Long Beach, US
Abstract
The outsourcing of semiconductor manufacturing raises security risks, such as piracy and overproduction of hardware intellectual property. To overcome this challenge, logic locking has emerged to lock a given circuit using additional key bits. While single-key logic locking approaches have demonstrated serious vulnerability to a wide range of attacks, multi-key solutions, if carefully designed, can provide a reliable defense against not only oracle-guided logic attacks, but also removal and dataflow attacks. In this paper, using time base keys, we propose, implement and evaluate a family of secure multi-key logic locking algorithms called Cute-Lock that can be applied both in RTL-level behavioral and netlist-level structural representations of sequential circuits. Our extensive experimental results under a diverse range of attacks confirm that, compared to vulnerable state-of-the-art methods, employing the Cute-Lock family drives attacking attempts to a dead end without additional overhead.
08:55 CEST TS19.6 SAFELIGHT: ENHANCING SECURITY IN OPTICAL CONVOLUTIONAL NEURAL NETWORK ACCELERATORS
Speaker:
Salma Afifi, Colorado State University, US
Authors:
Salma Afifi1, Ishan Thakkar2 and Sudeep Pasricha1
1Colorado State University, US; 2University of Kentucky, US
Abstract
The rapid proliferation of deep learning has revolutionized computing hardware, driving innovations to improve computationally expensive multiply-accumulate operations in deep neural networks. Among these innovations are integrated silicon-photonic systems that have emerged as energy-efficient platforms capable of achieving light speed computation and communication, positioning optical neural network (ONN) platforms as a transformative technology for accelerating deep learning models such as convolutional neural networks (CNNs). However, the increasing complexity of optical hardware introduces new vulnerabilities, notably the risk of hardware trojan (HT) attacks. Despite the growing interest in ONN platforms, little attention has been given to how HT-induced threats can compromise performance and security. This paper presents an in-depth analysis of the impact of such attacks on the performance of CNN models accelerated by ONN accelerators. Specifically, we show how HTs can compromise microring resonators (MRs) in a state-of-the-art non-coherent ONN accelerator and reduce classification accuracy across CNN models by up to 7.49% to 80.46% by just targeting 10% of MRs. We then propose techniques to enhance ONN accelerator robustness against these attacks and show how the best techniques can effectively recover the accuracy drops.
09:00 CEST TS19.7 ONE MORE MOTIVATION TO USE EVALUATION TOOLS, THIS TIME FOR HARDWARE MULTIPLICATIVE MASKING OF AES
Speaker:
Hemin Rahimi, TU Darmstadt, DE
Authors:
Hemin Rahimi and Amir Moradi, TU Darmstadt, DE
Abstract
Safeguarding cryptographic implementations against the increasing threat of Side-Channel Analysis (SCA) attacks is essential. Masking, a countermeasure that randomizes intermediate values, is a cornerstone of such defenses. In particular, SCA-secure implementation of AES, the most-widely used encryption standard, can employ Boolean masking as well as multiplicative masking due to its underlying Galois field operations. However, multiplicative masking is susceptible to vulnerabilities, including the zero-value problem, which has been identified right after the introduction of multiplicative masking. At CHES 2018, De Meyer et al. proposed a hardware-based approach to manage these challenges and implemented multiplicative masking for AES, incorporating a Kronecker delta function and randomness optimization. In this work, we evaluate their design using the PROLEAD evaluation tool under the glitch- and transition-extended probing model. Our findings reveal a critical vulnerability in their first-order implementation of the Kronecker delta function, stemming from the employed randomness optimization. This leakage compromises the security of their presented masked AES Sbox. After pinpointing the source of such a leakage, we propose an alternative randomness optimization to address this issue, and demonstrate its effectiveness through rigorous evaluations by means of PROLEAD.
09:05 CEST TS19.8 THREE EYED RAVEN: AN ON-CHIP SIDE CHANNEL ANALYSIS FRAMEWORK FOR RUN-TIME EVALUATION
Speaker:
M Dhilipkumar, IIT Kanpur, IN
Authors:
M Dhilipkumar, Priyanka Bagade and Debapriya Basu Roy, IIT Kanpur, IN
Abstract
Side-channel attacks exploit the physical leakages from hardware components, such as power consumption, to break secure cryptographic algorithms and retrieve its secret key. Therefore, evaluating implementations of cryptographic algorithms against such analysis is of paramount importance. A typical side-channel evaluation framework requires external devices like sampling oscilloscope along with a customized analysis board which makes the evaluation both expensive and time-consuming. However recent advancements in developing on-chip sensors on FPGAs for monitoring side channel information pave the path towards a fully on-chip side channel analysis framework without the requirement of any external devices, reducing both the cost and time required to carry out these experiments. In this paper, we propose our on-chip side channel analysis framework Raven that is augmented with hardware implementations of Test Vector Leakage Assessment (TVLA), Correlation Power Analysis (CPA), and Deep Learning based Leakage Assessment (DL-LA). The presence of on-chip hardware implementations of these side-channel evaluation algorithms coupled with on-chip sensors allows RAVEN to assess the side-channel security of the crypto-implementation in a fast and efficient manner. Our proposed implementation for DL-LA can also get trained on-chip and does not require the pre-trained weight values. The resource consumption of RAVEN is not high as the entire design along with the sensors can be fit into PYNQ board of AMD-Xilinx. We have validated the proposed RAVEN framework on AES-128 traces and results of the hardware implementation of TVLA, CPA, DL-LA closely resemble the results of software implementations, requiring significantly less time and storage.
09:10 CEST TS19.9 RTL-BREAKER: ASSESSING THE SECURITY OF LLMS AGAINST BACKDOOR ATTACKS ON HDL CODE GENERATION
Speaker:
Lakshmi Likhitha Mankali, New York University, US
Authors:
Lakshmi Likhitha Mankali1, Jitendra Bhandari1, Manaar Alam2, Ramesh Karri1, Michail Maniatakos2, Ozgur Sinanoglu2 and Johann Knechtel2
1New York University, US; 2New York University Abu Dhabi, AE
Abstract
Large language models (LLMs) have demonstrated remarkable potential with code generation/completion tasks for hardware design. However, the reliance on such automation introduces critical security risks. Notably, given that LLMs have to be trained on vast datasets of codes that are typically sourced from publicly available repositories, often without thorough validation, LLMs are susceptible to so-called data poisoning or backdoor attacks. Here, attackers inject malicious code for the training data, which can be carried over into the hardware description code (HDL) generated by LLMs. This threat vector can compromise the security and integrity of entire hardware systems. In this work, we propose RTL-Breaker, a novel backdoor attack framework on LLM-based HDL code generation. RTL-Breaker provides an in-depth analysis of essential aspects of this novel problem: 1) various trigger mechanisms versus their effectiveness for inserting malicious modifications, and 2) side-effects by backdoor attacks on code generation in general, i.e., impact on code quality. RTL-Breaker emphasizes the urgent need for more robust measures to safeguard against such attacks. Toward that end, we open-source our framework and all data.
09:15 CEST TS19.10 MC3: MEMORY CONTENTION-BASED COVERT CHANNEL COMMUNICATION ON SHARED DRAM SYSTEM-ON-CHIPS
Speaker:
Ismet Dagli, Colorado School of Mines, US
Authors:
Ismet Dagli1, James Crea1, Soner Seckiner2, Yuanchao Xu3, Selcuk Kose2 and Mehmet Belviranli1
1Colorado School of Mines, US; 2University of Rochester, US; 3University of California, Santa Cruz, US
Abstract
Shared memory system-on-chips (SM-SoCs) are ubiquitously employed by a wide range of computing platforms, including edge/IoT devices, autonomous systems, and smart-phones. In SM-SoCs, system-wide shared memory enables a convenient and financially feasible way to make data accessible across dozens of processing units (PUs), such as CPU cores and domain-specific accelerators. Due to the diverse computational characteristics of the PUs they embed, SM-SoCs often do not employ a shared last-level cache (LLC). While the literature studies covert channel attacks for shared memory systems, high-throughput communication is currently possible only through either relying on an LLC or having privileged/physical access to the shared memory subsystem. In this study, we introduce a new memory-contention-based covert communication attack, MC3, which specifically targets shared system memory in mobile SoCs. Unlike existing attacks, our approach achieves high-throughput communication without the need for an LLC or elevated access to the system. We explore the effectiveness of our methodology by demonstrating the trade-off between the channel transmission rate and the robustness of the communication. We evaluate MC3 on NVIDIA Orin AGX, NX, and Nano platforms and achieve transmission rates up to 6.4 Kbps with less than 1% error rate.
09:20 CEST TS19.11 COMB FREQUENCY DIVISION MULTIPLEXING: A NON-BINARY MODULATION FOR AIRGAP COVERT CHANNEL TRANSMISSION
Speaker:
Mohamed-alla-eddine BAHI, Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164, F-35000 Rennes, France, FR
Authors:
Mohamed-alla-eddine BAHI1, Maria MENDEZ REAL2 and Maxime PELCAT2
1Univ Rennes, INSA Rennes, IETR, UMR CNRS 6164, FR; 2IETR - UMR CNRS 6164, FR
Abstract
Isolated networks ensure the confidentiality of sensitive data on a system by eliminating all physical connections to public networks or external devices, making the system air-gapped. However, previous work has shown that Electromagnetic (EM) emanations when correlated with secret data, can lead to side or covert channels. Specifically, EM emissions caused by clocks can modulate high-frequency signals, enabling unauthorized data transmission to cross the air-gap. This work focuses on covert channels where a software or hardware Trojan inserted in the victim system induces side channel emissions that the attacker can recover through the covert channel, producing an intentional transmission and leakage of sensitive information. This paper introduces a novel encoding method for covert channels called Comb Frequency Division Multiplexing (CFDM). CFDM leverages modulated signals emitted by the victim system, which are evenly spaced across the frequency spectrum, creating a comb-like pattern. Moreover, the uncontrolled nature of the side channel modulation can make each subcarrier carry different information. Unlike traditional methods such as Frequency Shift Keying (FSK) and Amplitude Shift Keying (ASK), CFDM encodes information in both the frequency and amplitude dimensions of the covert channel harmonic sub-carriers.
09:21 CEST TS19.12 MULTI-SENSOR DATA FUSION FOR ENHANCED DETECTION OF LASER FAULT INJECTION ATTACKS IN CRYPTOGRAPHIC HARDWARE: PRACTICAL RESULTS
Speaker:
Naghmeh Karimi, University of Maryland Baltimore County, US
Authors:
Mohammad Ebrahimabadi1, Raphael Viera2, Sylvain Guilley3, Jean Luc Danger4, Jean-Max Dutertre5 and Naghmeh Karimi1
1University of Maryland Baltimore County, US; 2Ecole de Mines de Saint-Etienne, FR; 3Secure-IC, FR; 4Télécom ParisTech, FR; 5Mines Saint-Etienne, FR
Abstract
Though considered secure, cryptographic hardware can be compromised by adversaries injecting faults during runtime to leak secret keys from faulty outputs. Among fault injection methods, laser illumination has gained the most attention due to its precision in targeting specific areas and its fine temporal control. Accordingly, to tackle such attacks, this paper proposes a low-cost detection scheme that leverages Time-To-Digital Converters (TDC) to sense the IR drops caused by laser illumination. To mitigate the false alarm rate while maintaining a high detection rate, our method embeds multiple sensors (as few as two, as discussed in the text). To evaluate the impact of laser illumination and the effectiveness of our proposed scheme, we conducted extensive experiments (≈200k) using a real laser setup to illuminate the targeted AES module implemented on an Artix-7 FPGA. The results confirm the high accuracy of our detection method; achieving 82% fault detection with less than 0.01% false alarms and a detection latency of just 4 clock cycles. Notably, it enabled preventive actions in 70% of cases where illumination occurred but the AES outcome had not changed, greatly enhancing circuit security against key leakage.

W07 Designing Sustainable Intelligent Systems: Integrating Carbon Footprint Reduction, TinyML, and RISC-V

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 12:30 CEST


FS09 Focus Session

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST


MPP02 Multi-Partner Projects

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST MPP02.1 MULTI-PARTNER PROJECT: SECURING FUTURE EDGE-AI PROCESSORS IN PRACTICE (CONVOLVE)
Speaker:
Sven Argo, Ruhr-University Bochum, DE
Authors:
Sven Argo1, Henk Corporaal2, Alejandro Garza3, Manil Dev Gomony4, Tim Güneysu1, Adrian Marotzke3, Fouwad Mir5, Jan Richter-Brockmann1, Jeffrey Smith2 and Mottaqiallah Taouil6
1Ruhr University Bochum, DE; 2Eindhoven University of Technology, NL; 3NXP Semiconductors, DE; 4Eindhoven Unversity of Technology, NL; 5Delft University of Technology (TU Delft), NL; 6TU Delft, NL
Abstract
Artificial Intelligence (AI) has had a profound impact on our contemporary society, and it is indisputable that it will continue to play a significant role in the future. To further enhance AI experience and performance, a transition from large-scale server applications towards AI-powered edge devices is inevitable. In fact, current projections indicate that the market for Smart Edge Processors (SEPs) will grow beyond 70 Billion USD by 2026 [1]. Such a shift comes with major challenges, as these devices have limited computing and energy resources yet need to be highly performant. Additionally, security mechanisms need to be implemented to protect against diverse attack vectors as attackers now have physical access to the device. Besides cryptographic keys, Intellectual Property (IP), including neural network weights, may also be potential targets. The CONVOLVE [2] project (currently in its intermediate stage) follows a holistic approach to address these challenges and establish the EU in a leading position in embedded, ultra-low-power and secure processors for edge computing. It encompasses novel hardware technologies, end-to-end integrated workflows, and a security-by-design approach. This paper highlights the security aspects of future edge-AI processors by illustrating challenges encountered in CONVOLVE, the solutions we pursue including some early results, and directions for future research.
11:05 CEST MPP02.2 MULTI-PARTNER PROJECT: OPEN-SOURCE DESIGN TOOLS FOR CO-DEVELOPMENT OF AI ALGORITHMS AND AI CHIPS
Speaker:
Mehdi Tahoori, Karlsruhe Institute of Technology, DE
Authors:
Mehdi Tahoori1, Joerg Henkel1, Jürgen Teich2, Juergen Becker1, Ulf Schlichtmann3, Norbert Wehn4, Georg Sigl5 and Wolfgang Kunz4
1Karlsruhe Institute of Technology, DE; 2Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; 3TU Munich, DE; 4University of Kaiserslautern-Landau, DE; 5TU Munich/Fraunhofer AISEC, DE
Abstract
Chip technologies are crucial for the digital transformation of industry and society. Artificial Intelligence (AI) is playing an increasingly important role in both our daily lives and in industry. The development of advanced AI chip designs, essential for the successful deployment of AI, is of critical importance for innovation and competitiveness. However, challenges arise from the complexity of hardware development, expensive access to state-of-the-art design tools, and a global shortage of hardware experts. In addition to cost optimization, computational power, and energy consumption, security and trustworthiness are becoming increasingly important. This project aims to address these challenges in AI chip design by enabling efficient hardware development. We are developing a seamless transition between software-based AI model development and optimization, and efficient hardware implementation, while considering security, trustworthiness, and energy efficiency. An open-source approach plays a key role, facilitating access for small and medium-sized enterprises (SMEs) and expanding the community involved in AI chip design to help mitigate the shortage of skilled professionals.
11:10 CEST MPP02.3 MULTI-PARTNER PROJECT: SUSTAINABLE TEXTILE ELECTRONICS (STELEC)
Speaker:
Bo Zhou, German Research Centre for Artificial Intelligence (DFKI), DE
Authors:
Bo Zhou1, Mengxi Liu1, Sizhen Bian1, Daniel Geißler1, Paul Lukowicz1, Jose Miranda2, Jonathan Dan3, David Atienza3, Mohamed Riahi4, Nobert Wehn4, Russel Torah5, Sheng Yong5, Jidong Liu5, Stephen Beeby5, Magdalena Kohler6, Berit Greinke6, Junchun Yu7, Vincent Nierstrasz7, Leila Sheldrick8, Rebecca Stewart8, Tommaso Nieri9, Matteo Maccanti9 and Daniele Spinelli9
1DFKI, DE; 2EPFl, CH; 3EPFL, CH; 4RPTU, DE; 5University of Southampton, GB; 6UDK, DE; 7University of Boras, DE; 8Imperial College London, GB; 9Next Technology Tecnotessile, IT
Abstract
E-textiles are rapidly emerging as an important area of electronic circuit applications. It also facilitates many socially important applications such as personalized health, elderly care, and smart agriculture. However, the environmental impact and sustainability of e-textiles remain very problematic. STELEC, short for Sustainable Textile ELECtronics, is an interdisciplinary research project funded by the European Innovation Council (EIC) under the Pathfinder programme on the responsible electronics topic seeking cutting-edge innovation. STELEC started in September 2024 and is in its initial stage.The project is a multinational collaboration of research institutes, universities and companies across Europe. It aims at developing next-generation textile-based electronics in applications from sensing, processing to AI, with a commitment to full lifecycle sustainability.
11:15 CEST MPP02.4 MULTI-PARTNER PROJECT: TWINNING FOR EXCELLENCE IN RELIABLE ELECTRONICS (TWIN-RELECT)
Speaker:
Marko Andjelkovic, Leibniz-Institut für innovative Mikroelektronik, DE
Authors:
Marko Andjelkovic1, Fabian Vargas1, Milos Krstic1, Luigi Dilillo2, Alain Michez3, Frederic Wrobel3, Davide Bertozzi4, Mikel Lujan4, Christos Georgakidis5, Keterina Tsilingiri5, Nikolaos Chatzivangelis5, Nikolaos Zazatis5, Giorgos-Ioanis Pagliaroutis5, Pelopidas Tsoumanis5 and Christos Sotiriou5
1Leibniz-Institut für innovative Mikroelektronik, DE; 2CNRS, FR; 3Université de Montpellier, FR; 4The University of Manchester, GB; 5University of Thessaly, GR
Abstract
Reliable electronics plays a major role in shaping our daily lives, being a key enabler for critical applications, such as space missions, avionics, automotive, medicine, banking, auto-mated industry, wireless communication networks, etc. However, design of highly reliable electronic systems remains a challenge with the advances in semiconductor technology and increase in integrated circuit (IC) complexity. In this work, we introduce the Horizon Europe Twinning project TWIN-RELECT, aimed at strengthening the scientific expertise in designing reliable integ-rated circuits. The paper presents the general project concept and objectives, and main directions of the joint research activities. The primary scientific goal is to contribute to the development of novel, more efficient, European Electronic Design Automation (EDA) tool-chain for design of reliable chips.
11:20 CEST MPP02.5 MULTI-PARTNER PROJECT: LOLIPOP-IOT – DESIGN AND SIMULATION OF ENERGY-EFFICIENT DEVICES FOR THE INTERNET OF THINGS
Speaker:
Jakub Lojda, Brno University of Technology, CZ
Authors:
Jakub Lojda, Josef Strnadel, Pavel Smrz and Vaclav Simek, Brno University of Technology, CZ
Abstract
This paper presents an overview of the Internet of Things (IoT) device design and simulation, with a specific focus on low-power design principles – everything in the context of the LoLiPoP-IoT project. The project aims to enhance IoT device usability by reducing maintenance requirements related to battery recharging or replacement. Another key goal is to significantly decrease the massive waste generated by discarded primary batteries, contributing to more sustainable and user-friendly IoT solutions for the future. The primary focus of this paper is on a custom IoT localization tag, for which we simulate solar cells – ranging from basic modeling to their integration into electrical circuits – and the power consumption of the tag's electronics platform. The analyzed sample platform is built on the nRF52833 microcontroller and the DW3110 ultra-wideband transceiver. We also applied our experimental framework principles to optimize power consumption and extend battery life. Reductions in photovoltaic panel area were achieved for both devices with a 5-year lifespan and fully autonomous tags, though with increased localization latency. Furthermore, this paper demonstrates how IoT devices, including their firmware, can be effectively modeled and simulated using publicly available tools.
11:25 CEST MPP02.6 MULTI-PARTNER PROJECT: CONTRIBUTING TO TRUSTED CHIP DESIGN USING REVERSE ENGINEERING METHODS (RESEC)
Speaker:
Johanna Baehr, Fraunhofer AISEC, DE
Authors:
Bernhard Lippmann1, Johanna Baehr2, Horst Gieser3 and Alexander Hepp2
1Infineon Technologies, DE; 2TU Munich, DE; 3Fraunhofer EMFT, DE
Abstract
Abstract—The RESEC (REconstruction of highly integrated SECurity devices) project addresses the growing concerns of malicious modification and IP piracy in globally distributed supply chains. The project's primary objective is to develop, verify, and optimize a complete reverse engineering process for integrated circuits manufactured in technology nodes of 40 nm and below. This paper highlights the significant contributions of RESEC in the areas of sample preparation, computer vision, and netlist analysis, thereby extending the state-of-the-art in reverse engineering. The project's outcomes are expected to have a profound impact on the development and physical verification of trusted chips, paving the way for future research
11:30 CEST MPP02.7 MULTI-PARTNER PROJECT: SMART SENSOR ANALOG FRONT-ENDS POWERED BY EMERGING RECONFIGURABLE DEVICES (SENSOTERIC)
Speaker:
Jens Trommer, NaMLab gGmbH, DE
Authors:
Giulio Galderisi1, Andreas Kramer2, Andreas Fuchsberger3, Jose Maria Gonzalez-Medina4, Yuxuan He1, Lee-Chi Hung4, Merrit Jen Hong Li5, Julian Kulenkampff2, Maximilian Reuter2, Lukas Wind3, Masiar Sistani3, Thomas Mikolajick1, Bruno Neckel-Wesling6, Marina Deng6, Cristell Maneux6, Pieter Harpe5, Sonia Prado Lopez3, Oskar Baumgartner4, Chhandak Mukherjee6, Eugenio Cantatore5, Sandro Carrara7, Klaus Hofmann2, Water Weber3 and Jens Trommer1
1NaMLab gGmbH, DE; 2TU Darmstadt, DE; 3TU Vienna, AT; 4Global TCAD Solutions GmbH, AT; 5Eindhoven University of Technology, NL; 6University of Bordeaux, FR; 7EPFL, CH
Abstract
This work introduces SENSOTERIC, an HORIZON EU multi-partner project that aims at leveraging the properties of emerging Reconfigurable Field Effect Transistors (RFETs) to develop a sensor platform. RFETs will be used for a generic sensor interface and for a dedicated transducer element. In the first case, our goal is to develop an analog front-end interface that can be tuned at runtime to adapt to different environmental conditions and be used in a broad spectrum of applications. This feature shall be enabled by the polarity-control and negative differential resistance characteristics of the reconfigurable devices employed, that are co-integrable on industrial CMOS processes such as 22 nm FDSOI. In the second case, we want to exploit the intrinsic nature of these doping-free devices to yield better 1/f noise performances when compared to classic CMOS transducers. Moreover, the presence of un-gated areas on top of the channel of these devices, makes them the perfect candidates to be functionalized. In this early-stage overview of the project, we will introduce the key features and the vision that make SENSOTERIC a unique contribution towards smart sensing solutions in environmental monitoring and healthcare.
11:31 CEST MPP02.8 MULTI-PARTNER PROJECT: SECURE HARDWARE ACCELERATED DATA ANALYTICS FOR 6G NETWORKS: THE PRIVATEER APPROACH
Speaker:
Ilias Papalamprou, National TU Athens, GR
Authors:
Ilias Papalamprou1, Aimilios Leftheriotis2, Apostolos Garos3, Georgios Gardikis4, Maria Christopoulou5, George Xilouris5, Lampros Argyriou6, Antonia Karamatskou7, Nikolaos Papadakis6, Emmanouil Kalotychos8, Nikolaos Chatzivasileiadis8, Dimosthenis Masouros1 and Dimitrios Soudris1
1National TU Athens, GR; 2University of Patras, GR; 3R&D Department, Space Hellas S.A., GR; 4R&D Department, Space Hellas S.A., ; 5Institute of Informatics and Telecommunications, NCSR "Demokritos", GR; 6Infili Technologies S.A., GR; 7Infili Tehcnologies S.A., GR; 8UBITECH Ltd., Digital Security & Trusted Computing Group, GR
Abstract
Next generation 6G networks are designed to meet the requirements of modern applications, including the need for higher bandwidth and ultra-low latency services. While these networks show significant potential to fulfill these evolving connectivity needs, they also bring new challenges, particularly in the area of security. Meanwhile, ensuring the privacy is paramount in 6G network development, demanding robust solutions following "privacy-by-design" principles. To address these challenges, PRIVATEER project strengthens existing security mechanisms, introducing privacy-centric enablers tailored for 6G networks. This work, evaluates key enablers within PRIVATEER, focusing on the development and acceleration of AI-driven anomaly detection models, as well as attestation mechanisms for both hardware accelerators and containerized applications

SD04 Special Day on Emerging Computing Paradigms

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST

Session chair:
John Paul Strachan, Forschungszentrum Juelich GmbH, DE

Time Label Presentation Title
Authors
11:00 CEST SD04.1 HOW TO BUILD QUANTUM COMPUTERS AND HOW TO USE THEM
Presenter:
Tommaso Calarco, University of Cologne, DE
Author:
Tommaso Calarco, University of Cologne, DE
Abstract
.
11:22 CEST SD04.2 CELLULAR AND DEVELOPMENTAL PATHWAYS TO MACHINE INTELLIGENCE
Presenter:
Sebastian Risi, IT Copenhagen, DK
Author:
Sebastian Risi, IT Copenhagen, DK
Abstract
.
11:45 CEST SD04.3 TOWARDS SCALABLE PROBABILISTIC COMPUTERS FOR BINARY OPTIMIZATION AND BEYOND
Presenter:
Corentin Delacour, University of California, Santa Barbara, US
Author:
Corentin Delacour, University of California, Santa Barbara, US
Abstract
.
12:07 CEST SD04.4 NEUROMORPHIC COMPUTING AT CLOUD LEVEL
Presenter:
Christian Mayr, TU Dresden, DE
Author:
Christian Mayr, TU Dresden, DE
Abstract
.

TS20 Physical analysis and design

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS20.1 MEGAROUTE: UNIVERSAL AUTOMATED LARGE-SCALE PCB ROUTING METHOD WITH ADAPTIVE STEP-SIZE SEARCH
Speaker:
Haiyun Li, Tsinghua University, CN
Authors:
Haiyun Li1 and Jixin Zhang2
1School of Computer Science, Hubei University of Technology, Wuhan, China; Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, CN; 2Hubei University of Technology, CN
Abstract
The automation of very large-scale PCB routing has long been an unresolved problem within the industry due to the variant electronic components and complex design rules. Existing automated PCB routing methods are primarily designed for single component (e.g., BGA, BTB, etc.) or for simple and small-scale PCBs, and often fail to meet the industry requirements for large-scale PCBs. The biggest challenge is to ensure nearly 100% routability and DRC compliance while achieving high efficiency for large-scale PCBs with various components. To address this challenge, we propose MegaRoute, a precise, efficient, and universal PCB routing method that surpasses the routing routability and DRC compliance of existing methods, including commercial tools, for PCBs with thousands of nets. MegaRoute introduces an adaptive step-size search algorithm that adjusts exploration steps based on design rules and surrounding obstacles, improving both routability and efficiency. We incorporate shape-based obstacle detection for strict DRC compliance and use routing optimization techniques to enhance routability. We conduct extensive experiments on hundreds of real-world PCBs, including mainboard PCBs with thousands of nets. The results show that MegaRoute achieves over 98% routability across all PCBs with DRC-free results, significantly outperforming the state-of-the-art methods and mainstream commercial tools.
11:05 CEST TS20.2 TIMING-DRIVEN GLOBAL PLACEMENT WITH HYBRID HEURISTICS AND NADAM-BASED NET WEIGHTING
Speaker:
Linhao Lu, Southwest University of Science and Technology, CN
Authors:
Linhao Lu, Wenxin Yu, Hongwei Tian, Chenjin Li, Xinmiao Li, Zhaoqi Fu and Zhengjie Zhao, Southwest University of Science and Technology, CN
Abstract
Timing optimization is critical to the entire design flow of the very-large-scale integrated (VLSI) circuit, and Global Placement is pivotal in achieving timing closure within the design flow of very-large-scale integration circuits. However, most global placement algorithms focus on optimizing wirelength rather than timing. Therefore, we propose a novel timing-driven global placement algorithm to address this gap. This paper proposes a timing-driven global placement algorithm utilizing a Nadam-based net-weighting strategy. Additionally, we employ a hybrid heuristic approach for adaptive dynamic adjustment of net weights. The experimental results on the ICCAD 2015 contest benchmarks show that compared to the RePlAce, our algorithm significantly improves WNS and TNS by 40.7% and 56.5%, respectively.
11:10 CEST TS20.3 IR-FUSION: A FUSION FRAMEWORK FOR STATIC IR DROP ANALYSIS COMBINING NUMERICAL SOLUTION AND MACHINE LEARNING
Speaker:
Feng Guo, Beijing University of Posts and Telecommunications, CN
Authors:
Feng Guo1, Jianwang Zhai1, Jingyu Jia1, Jiawei Liu1, Kang Zhao1, Bei Yu2 and Chuan Shi1
1Beijing University of Posts and Telecommunications, CN; 2The Chinese University of Hong Kong, HK
Abstract
IR Drop analysis for on-chip power grids (PGs) is vital but computationally challenging due to the rapid growth in the integrated circuit (IC) scale. Traditional numerical methods employed by current EDA software are accurate but extremely time-consuming. To achieve rapid analysis of IR drop, various machine learning (ML) methods have been introduced to address the inefficiency of numerical methods. However, the issue of interpretability or scalability has been limiting practical applications. In this work, we propose IR-Fusion, which aims to combine numerical methods with ML to achieve the trade-off and complementarity between accuracy and efficiency in static IR drop analysis. Specifically, the numerical method is used to obtain rough solutions and ML models are utilized to improve accuracy further. In our framework, an efficient numerical solver, AMG-PCG, is applied to get rough numerical solutions. Then, based on the numerical solution, the fusion of hierarchical numerical-structural information representing the multilayer structure of the PG is employed, and an Inception Attention U-Net model is designed to capture details and interaction of features at different scales. To cope with the limitations and diversity of PG designs, an augmented curriculum learning strategy is applied to the training phase. Evaluation of IR-Fusion shows that its accuracy is significantly better than previous ML-based methods while requiring considerably less iteration on solver to achieve the same accuracy compared with numerical methods.
11:15 CEST TS20.4 TIMING-DRIVEN DETAILED PLACEMENT WITH UNSUPERVISED GRAPH LEARNING
Speaker:
Dhoui Lim, Ulsan National Institute of Science and Technology, KR
Authors:
DhouI Lim1 and Heechun Park2
1Kookmin University, School of Electrical Engineering, KR; 2Ulsan National Institute of Science and Technology (UNIST), KR
Abstract
Detailed placement is a crucial stage in VLSI design that starts from the global placement result to determine the final legal locations of each cell through fine-grained optimization. Traditional detailed placement methods focus on minimizing the half-perimeter wire length (HPWL) as in global placement. However, incorporating timing-driven placement becomes essential with the increasing complexity of VLSI designs and tighter performance constraints. In this paper, we propose a timing-driven detailed placement framework that leverages unsupervised graph learning techniques. Specifically, we integrate timing-related metrics into the objective function for detailed placement and formulate it into the loss function of a graph neural network (GNN) model. The loss function includes overlap, legality, and timing-related arc lengths, with appropriate weights using Bayesian optimization. Experimental results show that our framework achieves comparable or improved HPWL while significantly reducing total negative slack (TNS) by 5.5%, compared to existing methods.
11:20 CEST TS20.5 EFFICIENT AND EFFECTIVE MACRO PLACEMENT FOR VERY LARGE SCALE DESIGNS USING RL AND MCTS INTEGRATION
Speaker:
Zong-Ze Lee, National Cheng Kung University, TW
Authors:
Jai-Ming Lin1, Zong-Ze Lee1 and Nan-Chu Lin2
1Department of Electrical Engineering, National Cheng Kung University, TW; 2National Cheng Kung University, TW
Abstract
Macro placement plays a critical role in modern designs. With the rise of artificial intelligence, some researchers have turned to reinforcement learning (RL) techniques to handle this problem. However, these approaches usually require substantial computing resources and runtime for training, making them impractical for very large-scale integration (VLSI) designs. To address these challenges, this paper proposes an effective placer based on the Monte Carlo Tree Search (MCTS) algorithm, guided by a pre-trained RL agent. To reduce the complexities of RL and MCTS, we transform the macro placement problem into a macro group allocation problem. Additionally, we propose a new reward function to facilitate training convergence in RL. Moreover, to reduce runtime without affecting placement quality, we use the pre-training result to directly evaluate the placement quality in MCTS for non-terminal nodes, significantly reducing the number of placement runs required. Experiments show that our MCTS-based placer can achieve high-quality results even in the early stages of RL training. Moreover, our method outperforms state-of-the-art placers.
11:25 CEST TS20.6 DAMIL-DCIM: A DIGITAL CIM LAYOUT SYNTHESIS FRAMEWORK WITH DATAFLOW-AWARE FLOORPLAN AND MILP-BASED DETAILED PLACEMENT
Speaker:
Chuyu Wang, Fudan University, CN
Authors:
Chuyu Wang, Ke Hu, Fan Yang, Keren Zhu and Xuan Zeng, Fudan University, CN
Abstract
Digital computing-in-memory (DCIM) systems integrate complex digital logic with parasitic-sensitive bitcell arrays. Conventional physical design strategies degrade DCIM performance due to a lack of dataflow regularity and excessive wirelength. As a result, current DCIM design often relies on manual layout, which is time-consuming and a bottleneck in the design cycle. Existing layout synthesis frameworks for DCIM often mimic the manual approach and employ a template-based method for DCIM placement. However, overly constrained templates lead to excessive core area, resulting in high costs in practice. In this work, we introduce DAMIL-DCIM, a novel placement framework that bridges template-based techniques with optimization-based placement methods. DAMIL-DCIM utilizes a global dataflow-aware floorplan inspired by template methods and further optimizes the layout using MILP(Mixed Integer Linear Programming)-based detailed placement. The combination of global floorplanning and placement optimization reduces total wire length while maintaining dataflow regularity, resulting in lower parasitic and enhanced performance. Experimental results show, on a practical 28nm DCIM circuit, our approach improves frequency by 25.2% and reduces power consumption by 19.6% compared to Cadence Innovus, while maintaining the same core area.
11:30 CEST TS20.7 BI-LEVEL OPTIMIZATION ACCELERATED DRC-AWARE PHYSICAL DESIGN AUTOMATION FOR PHOTONIC DEVICES
Speaker:
Hao Chen, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Hao Chen1, Yuzhe Ma1 and Yeyu Tong2
1The Hong Kong University of Science and Technology (Guangzhou), CN; 2The Hong Kong University of Science and Technology (Guangzhou)), CN
Abstract
Photonic integrated circuits (PICs) design has been challenged by the complex physics behind various integrated photonic devices. Inverse design offers an effective design automation solution for obtaining high-performance and compact photonic devices using computational algorithms and electromagnetic (EM) simulations. However, the challenge lies in transforming the fabrication-infeasible device geometries obtained from computational algorithms into reliable while optimal physical design. Incorporating fabrication constraints into the optimization iterations can extend running time and lead to performance compromise. In this work, we proposed a novel DRC-aware photonic inverse design framework, leveraging the bi-level optimization to enable end-to-end gradient-based device optimization. Our method can guarantee all intermediate devices on the optimization trajectory adhere to fabrication requirements and rules. The proposed workflow eliminates the need for a binarization process and fabrication constraint adaption, thus enabling a fast and efficient search for high-performance and reliable integrated photonic devices. Experimental results demonstrate the benefits of our proposed method, including improved device performance and reduced EM simulations and running time.
11:35 CEST TS20.8 GTN-CELL: EFFICIENT STANDARD CELL CHARACTERIZATION USING GRAPH TRANSFORMER NETWORK
Speaker:
LIHAO LIU, State Key Lab of Integrated Chips and Systems, and School of Microelectronics, Fudan University, Shanghai, China., CN
Authors:
Lihao Liu, Beisi Lu, Yunhui Li, Li Shang and Fan Yang, Fudan University, CN
Abstract
Lookup table (LUT)-based libraries of standard cell characterization is crucial to accurate static timing analysis (STA). However, with the continuous scaling of technology nodes and the increasing complexity of circuit designs, the traditional non-linear delay model (NLDM) is progressively unable to meet the required accuracy for cell modeling. The current source model (CSM) offers a more precise characterization of cells at advanced nodes and is able to handle arbitrary electrical waveforms. However, the CSM is highly time-consuming because it requires extensive transistor-level simulations, posing severe challenges to efficient standard cell library design. This work presents GTN-Cell, an efficient graph transformer network (GTN)-based method for library-compatible LUT-based CSM waveform prediction of standard cell characterization. GTN-Cell represents the transistor-level structures of standard cells as graphs, learning the local structural information of each cell. By incorporating the transformer encoder into the model and embedding path-related positional encodings, GTN-Cell captures the global relationships between distant nodes within each cell. Compared with HSPICE, the GTN-Cell achieves an average error of 2.27% on predicted voltage waveforms among different standard cells and timing arcs while reducing the number of simulations by 70%.
11:40 CEST TS20.9 WIRE-BONDING FINGER PLACEMENT FOR FBGA SUBSTRATE LAYOUT DESIGN WITH FINGER ORIENTATION CONSIDERATION
Speaker:
Yu-En Lin, National Taiwan University of Science and Technology, Department of Computer Science and Information Engineering, TW
Authors:
Yu-En Lin and Yi-Yu Liu, National Taiwan University of Science and Technology, TW
Abstract
Wire bonding is a mature packaging technique that enables chip pins to transmit signals to bonding fingers on the substrate through bonding wires. Such commodity technology is also essential in supporting the rapid development of the system in package and heterogeneous integration technologies. However, the automation tools are relatively deficient compared to other packaging techniques, resulting in tremendous manual design time and engineering effort due to numerous wire-bonding design constraints. This paper addresses the finger placement problem and serves as the first work considering the orientation constraint of fingers. The finger placement flow is divided into three stages. First, an integer linear programming (ILP) formulation is developed to allocate each net finger row. After that, we utilize mixed-integer quadratic programming (MIQP) to place the bonding fingers and consider the wire crossing constraint. Finally, the locations of the bonding finger are refined by considering both the bonding finger orientation angle and the finger spacing constraints. The final layouts generated by our integrated finger placement and substrate routing framework outperform manual designs in terms of the design time, the total wirelength, and the routing completion rate.
11:45 CEST TS20.10 A PARALLEL FLOATING RANDOM WALK SOLVER FOR REPRODUCIBLE AND RELIABLE CAPACITANCE EXTRACTION
Speaker:
Jiechen Huang, Dept. Computer Science & Tech., Tsinghua University, CN
Authors:
Jiechen Huang1, Shuailong Liu2 and Wenjian Yu1
1Tsinghua University, CN; 2Exceeda Inc., CN
Abstract
The floating random walk (FRW) method is a popular and promising tool for capacitance extraction, but its stochastic nature leads to critical limitations in reproducibility and physics-related reliability. In this work, we present FRW-RR, a parallel FRW solver with enhancements for Reproducible and Reliable capacitance extraction. First, we propose a novel parallel FRW scheme that ensures reproducible results, regardless of the degree of parallelism (DOP) or machine used. We further optimize its parallel efficiency and enhance the numerical stability. Then, to guarantee the physical properties of capacitances and reliability for downstream tasks, we propose a regularization technique based on constrained multi-parameter estimation to postprocess FRW's results. Experiments on actual IC structures demonstrate that, FRW-RR ensures DOP-independent reproducibility (with at least 12 decimal significant digits) and physics-related reliability with negligible overhead. It has remarkable advantages over existing FRW solvers, including the one in [1].
11:50 CEST TS20.11 A COMPREHENSIVE INDUCTANCE-AWARE MODELING APPROACH TO POWER DISTRIBUTION NETWORK IN HETEROGENEOUS 3D INTEGRATED CIRCUITS
Speaker:
Yuanqing Cheng, Beihang University, CN
Authors:
Quansen Wang1, Vasilis Pavlidis2 and Yuanqing Cheng1
1Beihang University, CN; 2Aristotle University of Thessaloniki, GR
Abstract
Heterogeneous 3D integration technology is a costeffective and high-performance alternative to planar integrated circuits (ICs). In this paper, we propose an on-chip power distribution network (PDN) modeling technique for heterogeneous 3D-ICs (H3D-ICs), which explicitly takes the effects of on-chip inductance into account. The proposed model facilitates efficient transient and AC simulations with integrated inductive effects, enabling accurate noise characterization at high frequencies and facilitating the exploration of early-stage PDN design. The model is validated via HSPICE simulations, demonstrating a maximum error below 1% and achieving average speedups of 1.5x in transient and 8.5x in AC simulations.

TS21 Design methodologies for machine learning architectures

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS21.1 SPNERF: MEMORY EFFICIENT SPARSE VOLUMETRIC NEURAL RENDERING ACCELERATOR FOR EDGE DEVICES
Speaker:
Yipu Zhang, The Hong Kong University of Science and Technology, HK
Authors:
Yipu Zhang1, Jiawei Liang1, Jian Peng1, Jiang Xu2 and Wei Zhang1
1The Hong Kong University of Science and Technology, HK; 2The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
Neural rendering has gained prominence for its high-quality output, which is crucial for AR/VR applications. However, its large voxel grid data size and irregular access patterns challenge real-time processing on edge devices. While previous works have focused on improving data locality, they have not adequately addressed the issue of large voxel grid sizes, which necessitate frequent off-chip memory access and substantial on-chip memory. This paper introduces SpNeRF, a software-hardware co-design solution tailored for sparse volumetric neural rendering. We first identify memory-bound rendering inefficiencies and analyze the inherent sparsity in the voxel grid data of neural rendering. To enhance efficiency, we propose novel preprocessing and online decoding steps, reducing memory size for the voxel grid. The preprocessing step employs hash mapping to support irregular data access while maintaining a minimal memory size. The online decoding step enables efficient on-chip sparse voxel grid processing, incorporating bitmap masking to mitigate PSNR loss caused by hash collisions. To further optimize performance, we design a dedicated hardware architecture supporting our sparse voxel grid processing technique. Experimental results demonstrate that SpNeRF achieves an average 21.07× reduction in memory size while maintaining comparable PSNR levels. When benchmarked against Jetson XNX, Jetson ONX, RT-NeRF.Edge and NeuRex.Edge, our design achieves speedups of 95.1×, 63.5×, 1.5× and 10.3×, and improves energy efficiency by 625.6×, 529.1×, 4×, and 4.4×, respectively.
11:05 CEST TS21.2 SBQ: EXPLOITING SIGNIFICANT BITS FOR EFFICIENT AND ACCURATE POST-TRAINING DNN QUANTIZATION
Speaker:
Jiayao Ling, Shanghai Jiao Tong University, CN
Authors:
Jiayao Ling1, Gang Li2, Qinghao Hu2, Xiaolong Lin1, Cheng Gu1, Jian Cheng3 and Xiaoyao Liang1
1Shanghai Jiao Tong University, CN; 2Institute of Computing Technology, Chinese Academy of Sciences, CN; 3Institute of Automation, CN
Abstract
Post-Training Quantization is an effective technique for deep neural network acceleration. However, as the bit-width decreases to 4 bits and below, PTQ faces significant challenges in preserving accuracy, especially for attention-based models like LLMs. The main issue lies in considerable clipping and rounding errors induced by the limited number of quantization levels and narrow data range in conventional low-precision quantization. In this paper, we present an efficient and accurate PTQ method that targets 4 bits and below through algorithm and architecture co-design. Our key idea is to dynamically extract a small portion of significant bit terms from high-precision operands to perform low-precision multiplications under the given computational budget. Specifically, we propose Significant-Bit Quantization (SBQ). It exploits a product-aware method to dynamically identify significant terms and an error-compensated computation scheme to minimize compute errors. We present a dedicated inference engine to unleash the power of SBQ. Experiments on CNNs, ViTs, and LLMs reveal that SBQ consistently outperforms prior PTQ methods under 2~4-bit quantization. We also compare the proposed inference engine with state-of-the-art bit-operation-based quantization architectures TQ and Sibia. Results show that SBQ can achieve the highest area and energy efficiency.
11:10 CEST TS21.3 AIRCHITECT V2: LEARNING THE HARDWARE ACCELERATOR DESIGN SPACE THROUGH UNIFIED REPRESENTATIONS
Speaker:
Akshat Ramachandran, Georgia Tech, US
Authors:
Akshat Ramachandran1, Jamin Seo1, Yu-Chuan Chuang2, Anirudh Itagi1 and Tushar Krishna1
1Georgia Tech, US; 2National Taiwan University, TW
Abstract
Design space exploration (DSE) plays a crucial role in enabling custom hardware architectures, particularly for emerging applications like AI, where optimized and specialized designs are essential. With the growing complexity of deep neural networks (DNNs) and the introduction of advanced foundational models (FMs), the design space for DNN accelerators is expanding at an exponential rate. Additionally, this space is highly non-uniform and non-convex, making it increasingly difficult to navigate and optimize. Traditional DSE techniques rely on search-based methods, which involve iterative sampling of the design space to find the optimal solution. However, this process is both time-consuming and often fails to converge to the global optima for such design spaces. Recently, AIrchitect v1, the first attempt to address the limitations of search-based techniques, transformed DSE into a constant-time classification problem using recommendation networks. In this work, we propose AIrchitect v2, a more accurate and generalizable learning-based DSE technique applicable to large-scale design spaces that overcomes the shortcomings of earlier approaches. Specifically, we devise an encoder-decoder transformer model that (a) encodes the complex design space into a uniform intermediate representation using contrastive learning and (b) leverages a novel unified representation blending the advantages of classification and regression to effectively explore the large DSE space without sacrificing accuracy. Experimental results evaluated on 10^5 real DNN workloads demonstrate that, on average, AIrchitect v2 outperforms existing techniques by 15% in identifying optimal design points. Furthermore, to demonstrate the generalizability of our method, we evaluate performance on unseen model workloads (LLMs) and attain a 1.7x improvement in inference latency on the identified hardware architecture. Code and dataset are available at: https://github.com/maestro-project/AIrchitect-v2.
11:15 CEST TS21.4 ZEBRA: LEVERAGING DIAGONAL ATTENTION PATTERN FOR VISION TRANSFORMER ACCELERATOR
Speaker:
Sukhyun Han, Sungkyunkwan University, KR
Authors:
Sukhyun Han, Seongwook Kim, Gwangeun Byeon, Jihun Yoon and Seokin Hong, Sungkyunkwan University, KR
Abstract
Vision Transformers (ViTs) have achieved remarkable performance in computer vision, but their computational complexity and challenges in optimizing memory bandwidth limit hardware acceleration. A major bottleneck lies in the self-attention mechanism, which leads to excessive data movement and unnecessary computations despite high input sparsity and low computational demands. To address this challenge, existing transformer accelerators have leveraged sparsity in attention maps. However, their performance gains are limited due to low hardware utilization caused by the irregular distribution of non-zero values in the sparse attention maps. Self-attention often exhibits strong diagonal patterns in the attention map, as the diagonal elements tend to have higher values than others. To exploit this, we introduce Zebra, a hardware accelerator framework optimized for diagonal attention patterns. A core component of Zebra is the Striped Diagonal (SD) pruning technique, which prunes the attention map by preserving only the diagonal elements at runtime. This reduces computational load without requiring offline pre-computation or causing significant accuracy loss. Zebra features a reconfigurable accelerator architecture that supports optimized matrix multiplication method, called Striped Diagonal Matrix Multiplication (SDMM), which computes only the diagonal elements of matrices. With this novel method, Zebra addresses low hardware utilization, a key barrier to leveraging the diagonal patterns. Experimental results demonstrate that Zebra achieves a 57x speedup over a CPU and 1.7x over the state-of-the-art ViT accelerator with similar inference accuracy.
11:20 CEST TS21.5 PUSHING UP TO THE LIMIT OF MEMORY BANDWIDTH AND CAPACITY UTILIZATION FOR EFFICIENT LLM DECODING ON EMBEDDED FPGA
Speaker:
Jindong Li, Institute of Computing Technology, Chinese Academy of Sciences, CN
Authors:
Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang and Yi Zeng, Institute of Computing Technology, Chinese Academy of Sciences, CN
Abstract
The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.
11:25 CEST TS21.6 LEVERAGING COMPUTE-IN-MEMORY FOR EFFICIENT GENERATIVE MODEL INFERENCE IN TPUS
Speaker:
Zhantong Zhu, Peking University, CN
Authors:
Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang and Tianyu Jia, Peking University, CN
Abstract
With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.
11:30 CEST TS21.7 SPARSEINFER: TRAINING-FREE PREDICTION OF ACTIVATION SPARSITY FOR FAST LLM INFERENCE
Speaker:
Jiho Shin, University of Seoul, KR
Authors:
Jiho Shin1, Hoeseok Yang2 and Youngmin Yi3
1University of Seoul, KR; 2Santa Clara University, US; 3Sogang University, KR
Abstract
Leveraging sparsity is crucial for optimizing large language model (LLM) inference; however, modern LLMs employing SiLU as their activation function exhibit minimal activation sparsity. Recent research has proposed replacing SiLU with ReLU to induce significant activation sparsity and showed no downstream task accuracy degradation through fine-tuning. However, taking full advantage of it required training a predictor to estimate this sparsity. In this paper, we introduce SparseInfer, a simple, light-weight, and training-free predictor for activation sparsity of ReLU-fied LLMs, in which activation sparsity is predicted by comparing only the sign bits of inputs and weights. To compensate for possible prediction inaccuracy, an adaptive tuning of the predictor's conservativeness is enabled, which can also serve as a control knob for optimizing LLM inference. The proposed method achieves approximately 21% faster inference speed over the state-of-the-art, with negligible accuracy loss of within 1%p.
11:35 CEST TS21.8 LOW-RANK COMPRESSION FOR IMC ARRAYS
Speaker:
Kang Eun Jeon, Sungkyunkwan University, KR
Authors:
Kang Eun Jeon, Johnny Rhe and Jong Hwan Ko, Sungkyunkwan University, KR
Abstract
In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures.Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads, especially when model sparsity does not meet a specific threshold. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, notably suboptimal IMC array utilization and compromised accuracy compared to traditional pruning methods. To address these issues, we introduce a novel approach employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results, using ResNet-20 and Wide ResNet16-4 networks on CIFAR-10 and CIFAR-100 datasets, demonstrate that our proposed method not only matches the performance of existing pruning techniques on ResNet-20 but also achieves up to 2.5x speedup and +20.9% accuracy boost on Wide ResNet16-4.
11:40 CEST TS21.9 INTEGER UNIT-BASED OUTLIER-AWARE LLM ACCELERATOR PRESERVING NUMERICAL ACCURACY OF FP-FP GEMM
Speaker:
Jehun Lee, Seoul National University, KR
Authors:
Jehun Lee and Jae-Joon Kim, Seoul National University, KR
Abstract
The proliferation of large language models (LLMs) has significantly heightened the importance of quantization to alleviate the computational burden given the surge in the number of parameters. However, quantization often targets a subset of a LLM and relies on the floating-point (FP) arithmetic for matrix multiplication of specific subsets, leading to performance and energy overhead. Additionally, to compensate for the quality degradation incurred by quantization, retraining methods are frequently employed, demanding significant efforts and resources. This paper proposes OwL-P, an outlier-aware LLM inference accelerator which preserves the numerical accuracy of FP arithmetic while enhancing hardware efficiency with an integer (INT)-based arithmetic unit for general matrix multiplication (GEMM), through the use of a shared exponent and efficient management of outlier data. It also mitigates off-chip bandwidth requirements by employing a compressed number format. The proposed number format leverages outliers and shared exponents to facilitate the compression of both model weights and activations. We evaluate this work across 10 different transformer-based benchmarks, and the results demonstrate that the proposed integer-based LLM accelerator achieves an average 2.70× performance gain and 3.57× energy savings while maintaining the numerical accuracy of the FP arithmetic.
11:45 CEST TS21.10 LEVERAGING HOT DATA IN A MULTI-TENANT ACCELERATOR FOR EFFECTIVE SHARED MEMORY MANAGEMENT
Speaker:
Chunmyung Park, Seoul National University, KR
Authors:
Chunmyung Park, Jicheon Kim, Eunjae Hyun, Xuan Truong Nguyen and Hyuk-Jae Lee, Seoul National University, KR
Abstract
Multi-tenant neural networks (MTNN) have been emerging in various domains. To effectively handle multi-tenant workloads, modern hardware systems typically incorporate multiple compute cores with shared memory systems. While prior works have intensively studied compute- and bandwidth-aware allocation, on-chip memory allocation for MTNN accelerators has not been well studied. This work identifies two key challenges of on-chip memory allocation in MTNN accelerators: on-chip memory shortages, which force data eviction to off-chip memory, and on-chip memory underutilization, where memory remains idle due to coarse-grained allocation. Both issues lead to increased external memory accesses (EMAs), significantly degrading system performance. To address these challenges, we propose HotPot, a novel multi-tenant accelerator with a runtime temperature-aware memory allocator. HotPot prioritizes hot data for global on-chip memory allocation, reducing unnecessary EMAs and optimizing memory utilization. Specifically, HotPot introduces a temperature score that quantifies reuse potential and guides runtime memory allocation decisions. Experimental results demonstrate that HotPot improves system throughput (STP) by up to 1.88× and average normalized turnaround time (ANTT) by 1.52× compared to baseline methods.
11:50 CEST TS21.11 DOTS: DRAM-PIM OPTIMIZATION FOR TALL AND SKINNY GEMM OPERATIONS IN LLM INFERENCE
Speaker:
Gyeonghwan Park, Seoul National University, KR
Authors:
Gyeonghwan Park, Sanghyeok Han, Yoon Byungkuk and Jae-Joon Kim, Seoul National University, KR
Abstract
For large language models (LLMs), increasing token lengths require smaller batch sizes due to increase in memory requirement for KV caching, leading to under-utilization of processing units and memory bandwidth bottleneck in NPUs. To address the challenge, we propose DOTS, a new DRAM-PIM architecture that can handle both GEMV and GEMM efficiently, even outperforming NPUs in GEMM operations when batch sizes are small. The proposed DRAM-PIM reduces power consumption and latency caused by frequent DRAM row activation switching in conventional DRAM PIMs with negligible hardware overhead. Simulation results show that our proposed design achieves throughput improvements of 1.83x, 1.92x, and 1.7x over GPU, NPU, and heterogeneous NPU/PIM systems, respectively, for models as large as or larger than OPT-175B.
11:51 CEST TS21.12 LLM4GV: AN LLM-BASED FLEXIBLE PERFORMANCE-AWARE FRAMEWORK FOR GEMM VERILOG GENERATION
Speaker:
Meiqi Wang, Sun Yat-sen University, CN
Authors:
Dingyang Zou1, Gaoche Zhang1, kairui sun2, wen zhe3, Meiqi Wang2 and Zhongfeng Wang1
1Nanjing University, CN; 2Sun Yat-sen University, CN; 3sysu, CN
Abstract
Advancements in AI have increased the demand for specialized AI accelerators, with design for general matrix multiplication (GEMM) module being crucial but time-consuming. While large language models (LLMs) show promise for automating GEMM design, challenges arise from GEMM's vast design space and performance requirements. Existing LLM-based frameworks for RTL code generation often lack flexibility and performance awareness. To overcome the challenges, we propose LLM4GV, a multi-agent LLM-based framework that integrates hardware optimization techniques (HOTs) and performance modeling, improving correctness and performance of the generated code over prior works.

TS22 Design and test of hardware security primitives

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS22.1 USING OFF-SET ONLY FOR CORRUPTING CIRCUIT TO RESIST STRUCTURAL ATTACK IN CAC LOCKING
Speaker:
Hsaing-Chun Cheng, National Tsing Hua University, TW
Authors:
Hsaing-Chun Cheng, RuiJie Wang and TingTing Hwang, National Tsing Hua University, TW
Abstract
Corrupt-and-Correct (CAC) Logic Lockings [1]–[4] are state-of-the-art hardware security techniques designed to protect IC/IP designs from IP piracy, reverse engineering, overproduction, and unauthorized use. Although these techniques are resilient to SAT-based attacks, they remain vulnerable to structural attacks, which exploit structural traces left by the synthesis tool to recover the original form. In this paper, we will propose a novel method that uses only the OFF-set to corrupt the circuit. This approach helps the added circuitry better merge with the original circuit, thereby thwarting structural attacks while maintaining resilience to SAT-based attacks. Additionally, we demonstrate that our proposed method can incur less area overhead compared to previous locking methods in HIID [5]. Compared to SFLL-rem [4], our method can achieve comparable area overhead while effectively resisting structural attacks, including Valkyrie [6] and SPI attacks [7].
11:05 CEST TS22.2 RUNTIME SECURITY ANALYSIS OF MONOLITHIC 3D EMBEDDED DRAM WITH OXIDE-CHANNEL TRANSISTOR
Speaker:
Eduardo Ortega, Arizona State University, US
Authors:
Eduardo Ortega1, Jungyoun Kwak2, Shimeng Yu2 and Krishnendu Chakrabarty1
1Arizona State University, US; 2Georgia Tech, US
Abstract
We present the first security and disturbance study of monolithic 3D (M3D) embedded DRAM (eDRAM) with 2T gain cell using oxide-channel transistors. We explore the Rowhammer/Rowpress vulnerabilities on amorphous indium tungsten oxide (IWO) transistors for eDRAM with standalone 2D integration and memory-on-memory M3D integration. In addition, We examine M3D-specific electrical disturbances from memory-on-logic M3D integration. We evaluate IWO eDRAM's susceptibility to these vulnerabilities/disturbances and discuss the potential impact on M3D integration. We examine physical design and architecture strategies for M3D integration of IWO eDRAM. We provide systematic recommendations to inform security strategies for M3D integration and security of IWO eDRAM. Our results show that limiting the minimum vertical interlayer distance to 300 nm reduces vertical disturbances in memory-on-memory M3D integration. In addition, for memory-on-logic M3D integration, we observed that IWO eDRAM's read bitline is sensitive to crosstalk from high-speed switching logic circuits. In conjunction, we show that IWO eDRAM standalone 2D integration is 30X more resilient to Rowhammer than current state-of-the-art memory because the IWO transistor's ON/OFF current ratio is roughly three orders of magnitude greater than standard memory access transistors.
11:10 CEST TS22.3 EXPLORING LARGE INTEGER MULTIPLICATION FOR CRYPTOGRAPHY TARGETING IN-MEMORY COMPUTING
Speaker:
Florian Krieger, TU Graz, AT
Authors:
Florian Krieger, Florian Hirner and Sujoy Sinha Roy, TU Graz, AT
Abstract
Emerging cryptographic systems such as Fully Homomorphic Encryption (FHE) and Zero-Knowledge Proofs (ZKP) are computation- and data-intensive. FHE and ZKP implementations in software and hardware largely rely on the von Neumann architecture, where a significant amount of energy is lost on data movements. A promising computing paradigm is computing in memory (CIM) which enables computations to occur directly within memory thereby reducing data movements and energy consumption. However, efficiently performing large integer multiplications – critical in FHE and ZKP – is an open question, as existing CIM methods are limited to small operand sizes. In this work, we address this question by exploring advanced algorithmic approaches for large integer multiplication, identifying the Karatsuba algorithm as the most effective for CIM applications. Thereafter, we design the first Karatsuba multiplier for resistive CIM crossbars. Our multiplier uses a three-stage pipeline to enhance throughput and, additionally, balances memory endurance with efficient array sizes. Compared to existing CIM multiplication methods, when scaled up to the bit widths required in ZKP and FHE, our design achieves up to 916x in throughput and 281x in area-time product improvements.
11:15 CEST TS22.4 A LOW-COMPLEXITY TRUE RANDOM NUMBER GENERATION SCHEME USING 3D-NAND FLASH MEMORY
Speaker:
RUIBIN ZHOU, Sun Yat-sen University, CN
Authors:
Ruibin Zhou1, Jian Huang1, Xianping Liu2, Yuhan Wang1, Xinrui Zhang1, Yungen Peng1 and Zhiyi Yu1
1Sun Yat-Sen University, CN; 21.Sun Yat-Sen University 2.Peng Cheng Laboratory, CN
Abstract
Unpredictable true random numbers are essential in cryptographic applications and secure communications. However, implementing True Random Number Generators (TRNGs) typically requires specialized hardware devices. In this paper, we propose a low-complexity true random number extraction scheme that can be implemented in endpoint systems containing 3D-NAND flash memory chips, addressing the need for random numbers without requiring additional complex hardware. We successfully utilized the randomness of the rapid charging and discharging of shallow charge traps in 3D-NAND memory as an entropy source. The proposed approach only requires conventional user-mode erase, program, and read operations, without any special timing control. We successfully extracted random bitstream using this scheme without a post-debiasing process. We evaluated the randomness of the generated bitstream using the NIST SP 800-22 statistical test suite, and it passed all 15 tests.
11:20 CEST TS22.5 A SYNTHESIZABLE THYRISTOR-LIKE LEAKAGE-BASED TRUE RANDOM NUMBER GENERATOR
Speaker:
Seohyun Kim, Ajou University, KR
Authors:
Seo Hyun Kim, Jang Hyun Kim and Jongmin Lee, Ajou University, KR
Abstract
As the demand for random data in cryptographic systems continues to rise, the importance of True Random Number Generators (TRNGs) becomes increasingly crucial for securing cryptographic applications. However, designing a TRNG that is reliable, secure, and cost-effective presents a significant challenge in hardware security. In this paper, we propose a synthesizable TRNG design based on a thyristor-like leakage-based (TL) structure, optimized for secure applications with small area and cost-efficiency. Our design has been validated using a 65-nm CMOS process, achieving a throughput of 0.397-Mbps within a compact area of 14.4-μm2, offering considerable cost savings while maintaining high randomness and area-throughput trade-off of 27.57 Gbps/mm2. Moreover, this TRNG can be synthesized as a standard cell through a semi-custom design flow, significantly reducing design costs and enabling design automation, which streamlines the process and reduces the time and effort required compared to traditional full-custom TRNGs. Additionally, as it is library characterized, the number of TL TRNG cells can be freely adjusted to meet specific application requirements, offering flexibility in both performance and scalability. To assess its randomness, the NIST statistical test suite was applied, and the proposed TL TRNG successfully passed all applicable tests, demonstrating its randomness.
11:25 CEST TS22.6 GRAFTED TREES BEAR BETTER FRUIT: AN IMPROVED MULTIPLE-VALUED PLAINTEXT-CHECKING SIDE-CHANNEL ATTACK AGAINST KYBER
Speaker:
Jinnuo Li, School of Computer Science, China University of Geosciences, Wuhan, China, CN
Authors:
Jinnuo Li1, Chi Cheng1, Muyan Shen2, Peng Chen1, Qian Guo3, Dongsheng Liu4, Liji Wu5 and Jian Weng6
1China University of Geosciences, Wuhan, CN; 2School of Cryptology, University of Chinese Academy of Sciences, Beijing, China, CN; 3Lund University, Lund, Sweden, SE; 4School of Integrated Circuits, Huazhong University of Science and Technology, CN; 5(School of Integrated Circuits, Tsinghua University, Beijing, China, CN; 6College of Cyber Security, Jinan University, Guangzhou, China, CN
Abstract
As a prominent category of side-channel attacks (SCAs), plaintext-checking (PC) oracle-based SCAs offer the advantages of generality and operational simplicity on a targeted device. At TCHES 2023, Rajendran et al. and Tanaka et al. independently proposed the multiple-valued (MV) PC oracle, significantly reducing the required number of queries (a.k.a., traces) in the PC oracle. However, in practice, when dealing with environmental noise or inaccuracies in the waveform classifier, they still rely on majority voting or the other technique that usually results in three times the number of queries compared to the ideal case. In this paper, we propose an improved method to further reduce the number of queries of the MV-PC oracle, particularly in scenarios where the oracle is imperfect. Compared to the state-of-the-art at TCHES 2023, our proposed method reduces the number of queries for a full key recovery by more than 42.5%. The method involves three rounds. Our key observation is that coefficients recovered in the first round can be regarded as prior information to significantly aid in retrieving coefficients in the second round. This improvement is achieved through a newly designed grafted tree. Notably, the proposed method is generic and can be applied to both the NIST key encapsulation mechanism (KEM) standard Kyber and other significant candidates, such as Saber and Frodo. We have conducted extensive software simulations against Kyber-512, Kyber-768, Kyber-1024, FireSaber, and Frodo-1344 to validate the efficiency of the proposed method. An electromagnetic attack conducted on real-world implementations, using an STM32F407G board equipped with an ARM Cortex-M4 microcontroller and Kyber implementation from the public library pqm4, aligns well with our simulations.
11:30 CEST TS22.7 CAS-PUF: CURRENT-MODE ARRAY-TYPE STRONG PUF FOR SECURE COMPUTING IN AREA CONSTRAINED SOCS
Speaker:
Dimosthenis Georgoulas, University of Ioannina, GR
Authors:
Dimosthenis Georgoulas, Yiorgos Tsiatouhas and Vasileios Tenentes, University of Ioannina, GR
Abstract
Secure computing necessitates the integration in Systems-on-Chips (SoCs) of strong Physical Unclonable Functions (PUFs) that can generate a vast amount of Challenge Response Pairs (CRPs) for cryptographic keys generation, identification and authentication. However, the excessive area cost of strong PUF designs imposes integration difficulties to SoCs of area constrained applications, such as the IoT and mobile computing. In this paper, we present a novel strong PUF design, with silicon area requirements significantly lower than those of previous strong PUFs. The proposed Current-mode Array-type Strong PUF (CAS-PUF) is based on a current source topology of only six minimum size transistors, which is tolerant to power supply variation for enhanced reliability. Compared to previous strong PUFs, the CAS-PUF achieves the same number of CRPs with 20% to 72% less area size; while for the same area size, it provides 19 to 53 orders of magnitude higher number of CRPs. Furthermore, extensive Monte Carlo simulations on CAS-PUF show a reliability of 96.45% under ±10% power supply fluctuation; and 97.69% under temperature variation (0°C to 80°C), with an average uniqueness and uniformity of 50.01% and 49.54%, respectively. Therefore, the CAS-PUF can be used as a hardware root of trust mechanism to secure computing in area constrained SoCs.
11:35 CEST TS22.8 FLASH: AN EFFICIENT HARDWARE ACCELERATOR LEVERAGING APPROXIMATE AND SPARSE FFT FOR HOMOMORPHIC ENCRYPTION
Speaker:
Tengyu Zhang, Peking University, CN
Authors:
Tengyu Zhang1, Yufei Xue2, LING LIANG1, Zhen Gu3, Yuan Wang1, Runsheng Wang1, Ru Huang1 and Meng Li1
1Peking University, CN; 2Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, HK; 3Alibaba Group, CN
Abstract
Private convolutional neural network (CNN) inference based on hybrid homomorphic encryption (HE) and two-party computation (2PC) emerges as a promising technique for sensitive user data protection. However, homomorphic convolutions (HConvs) suffer from high computation costs due to the extensive number theoretic transforms (NTTs). While customized accelerators have been proposed, they usually overlook the intrinsic error resilience and native sparsity of DNNs and hybrid HE/2PC protocols. In this paper, we propose FLASH, leveraging these key characteristics for highly efficient HConv. Specifically, we observe the private DNN inference is robust to computation errors and propose approximate fast Fourier transforms (FFTs) to replace NTTs and avoid the expensive modular reduction operations. We also design a flexible sparse FFT dataflow leveraging the high sparsity of weight plaintexts. With extensive experiments, we demonstrate FLASH improves the power efficiency by 90.7x for weight transforms and by 9.7x for all transforms in HConvs compared to existing works. As for the HConvs in ResNet-18 and ResNet-50, FLASH achieves about 87.3% energy consumption reduction.
11:40 CEST TS22.9 HFL: HARDWARE FUZZING LOOP WITH REINFORCEMENT LEARNING
Speaker:
Lichao Wu, TU Darmstadt, DE
Authors:
Lichao Wu, Mohamadreza Rostami, Huimin Li and Ahmad-Reza Sadeghi, TU Darmstadt, DE
Abstract
As hardware systems grow increasingly complex, ensuring their security becomes more critical. This complexity often introduces difficult and costly vulnerabilities to address after fabrication. Traditional verification methods, such as formal and dynamic approaches, encounter limitations in scalability and efficiency when applied to complex hardware designs. While hardware fuzzing presents a promising solution for efficient and effective vulnerability detection, current methods face several challenges, including coverage saturation, long simulation times, and limited vulnerability detection capabilities. This paper introduces Hardware Fuzzing Loop (HFL), a novel fuzzing framework designed to address these limitations. We demonstrate that Long Short-Term Memory (LSTM), a machine learning model commonly used in natural language processing, can effectively capture the semantics of test cases and accurately predict hardware coverage. Building on this insight, we leverage reinforcement learning to optimize the test generation strategy dynamically within a hardware fuzzing loop. Our approach utilizes a multi-head LSTM to generate sophisticated RISC-V assembly instruction sequences, along with an LSTM-based predictor that evaluates the quality of these instructions. By dynamically interacting with the hardware, HFL efficiently explores complex instruction sequences with minimal fuzzing iterations, allowing it to uncover hard-to-detect vulnerabilities. We evaluated HFL on three RISC-V cores, and the results show that it achieves higher coverage using fewer than 1\% of the test cases required by leading hardware fuzzers, effectively mitigating the issue of coverage saturation. Furthermore, HFL identified all known vulnerabilities in the tested systems and discovered four previously unknown high-severity issues, demonstrating its significant potential in improving hardware security assessments.
11:45 CEST TS22.10 REAP-NVM: RESILIENT ENDURANCE-AWARE NVM-BASED PUF AGAINST LEARNING-BASED ATTACKS
Speaker:
Hassan Nassar, Karlsruhe Institute of Technology, DE
Authors:
Hassan Nassar1, Ming-Liang Wei2, Chia-Lin Yang2, Joerg Henkel1 and Kuan-Hsun Chen3
1Karlsruhe Institute of Technology, DE; 2National Taiwan University, TW; 3University of Twente, NL
Abstract
NVM-based PUFs offer secure authentication and cryptographic applications by exploiting NVMs' MLC to generate diverse, ML-attack-resistant responses.nYet, frequent writes degrade these PUFs, lowering reliability and lifespan. This paper presents a model to assess endurance effects on NVM PUFs, guiding the creation of more robust PUFs. Our novel NVM PUF design enhances endurance by evenly distributing writes, thus mitigating cell stress, achieving a 62x improvement over current solutions while preserving security against learning-based attacks.
11:46 CEST TS22.11 ACCELERATING OBLIVIOUS TRANSFER WITH A PIPELINED ARCHITECTURE
Speaker:
Xiaolin Li, Institute of Computing Technology, Chinese Academy of Sciences, CN
Authors:
Xiaolin Li1, wei yan1, yong zhang2, hongwei liu1, qinfen hao1, yong liu2 and ninghui sun1
1Institute of Computing Technology, Chinese Academy of Sciences, CN; 2Zhongguancun Laboratory, CN
Abstract
With the rapid development of machine learning and big data technologies, ensuring user privacy has become a pressing challenge. Secure multi-party computation offers a solution to this challenge by enabling privacy-preserving computations, but it also incurs significant performance overhead, thus limiting its further application. Our analysis reveals that the oblivious transfer protocol accounts for up to 96.64\% of execution time. To address these challenges, we propose POTA, a high-performance pipelined OT hardware acceleration architecture supporting the silent OT protocol. Finally, we implement a POTA prototype on Xilinx VCU129 FPGAs. Experimental results demonstrate that under various network settings, POTA achieves significant speedups, with maximum improvements of $22.67 imes$ for OT efficiency and $192.57 imes$ for basic operations in MPC applications.

TS23 Reconfigurable systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST

Time Label Presentation Title
Authors
11:00 CEST TS23.1 FPGA-BASED ACCELERATION OF MCMC ALGORITHM THROUGH SELF-SHRINKING FOR BIG DATA
Speaker:
Shuanglong Liu, Hunan Normal University, CN
Authors:
Shuanglong Liu, Shiyu Peng and Wan Shen, Hunan Normal University, CN
Abstract
Markov chain Monte Carlo (MCMC) algorithms are widely used in Bayesian inference to compute the posterior distribution of complex models, facilitating sampling from probability distributions. However, the computational burden of evaluating the likelihood function in MCMC poses significant challenges in big data applications. To address this, sub-sampling methods have been introduced to approximate the target distribution by using subsets of the data rather than the entire dataset. Unfortunately, these methods often lead to biased samples, making them impractical for real-world applications. This paper proposes a novel scaling MCMC method that achieves exact sampling by utilizing a subset (mini-batch) of the data with locally bounded approximations of the target distribution. Our method adaptively adjusts the mini-batch size by automatically tuning a hyperparameter based on the sample acceptance ratio, ensuring optimal balance between sample efficiency and computational cost. Moreover, we introduce a highly optimized hardware architecture to efficiently implement the proposed MCMC method onto FPGA. Our accelerator is evaluated on an AMD Zynq UltraScale+ FPGA device using a Bayesian logistic regression model on the MNIST dataset. The results demonstrate that our design achieves unbiased sampling with a 47.6 times speedup over the standard MCMC design, while also significantly reducing estimation errors compared to state-of-the-art MCMC methods.
11:05 CEST TS23.2 ATE-GCN: AN FPGA-BASED GRAPH CONVOLUTIONAL NETWORK ACCELERATOR WITH ASYMMETRICAL TERNARY QUANTIZATION
Speaker:
Ruiqi Chen, Vrije Universiteit Brussel, BE
Authors:
Ruiqi Chen1, Jiayu Liu2, Shidi Tang3, Yang Liu4, Yanxiang Zhu5, Ming Ling3 and Bruno da Silva1
1Vrije Universiteit Brussel, BE; 2University College London, GB; 3Southeast University, CN; 4Fudan University, CN; 5VeriMake Innovation Laboratory, CN
Abstract
Ternary quantization can effectively simplify matrix multiplication, which is the primary computational operation in neural network models. It has shown success in FPGA-based accelerator designs for emerging models such as GAT and Transformer. However, existing ternary quantization methods can lead to substantial accuracy loss under certain weight distribution patterns, such as GCN. Furthermore, current FPGA-based ternary weight designs often focus on reducing resource consumption while neglecting full utilization of FPGA DSP blocks, limiting maximum performance. To address these challenges, we propose ATE-GCN, an FPGA-based asymmetrical ternary quantization GCN accelerator using a software-hardware co-optimization approach. First, we adopt an asymmetrical quantization strategy with specific interval divisions tailored to the bimodal distribution of GCN weights, reducing accuracy loss. Second, we design a unified processing element (PE) array on FPGA to support various matrix computation forms, optimizing FPGA resource usage while leveraging the benefits of cascade design and ternary quantization, significantly boosting performance. Finally, we implement the ATE-GCN prototype on the VCU118 FPGA board. The results show that ATE-GCN maintains an accuracy loss below 2%. Additionally, ATE-GCN achieves average performance improvements of 224.13ˆ and 11.1ˆ, with up to 898.82ˆ and 69.9ˆ energy consumption saving compared to CPU and GPU, respectively. Moreover, compared to state-of-the-art FPGA-based GCN accelerators, ATE-GCN improves DSP efficiency by 63% with an average latency reduction of 11%.
11:10 CEST TS23.3 PREVV: ELIMINATING STORE QUEUE VIA PREMATURE VALUE VALIDATION FOR DATAFLOW CIRCUIT ON FPGA
Speaker:
Kuangjie Zou, Fudan University, CN
Authors:
Kuangjie Zou, Yifan Zhang, Zicheng Zhang, Guoyu Li, Jianli Chen, Kun Wang and Jun Yu, Fudan University, CN
Abstract
Dynamic scheduling in high-level synthesis (HLS) maximizes pipeline performance by enabling out-of-order scheduling of load and store requests at runtime. However, this method introduces unpredictable memory dependencies, leading to data disambiguation challenges. Load-store queues (LSQs), commonly used in superscalar CPUs, offer a potential solution for HLS. However, LSQs in dynamically scheduled HLS implementations often suffer from high resource overhead and scalability limitations. In this paper, we introduce PreVV, an architecture based on premature value validation designed to address memory disambiguation with minimal resource overhead. Our approach substitutes LSQ with several PreVV components and a straightforward premature queue. We prevent potential deadlocks by incorporating a specific tag that can send 'fake' tokens to prevent the accumulation of outdated data. Furthermore, we demonstrate that our design has scalability potential. We implement our design using several hardware templates and an LLVM pass to generate targeted dataflow circuits with PreVV. Experimental results on various benchmarks with data hazards show that, compared to state-of-the-art dynamic HLS, PreVV16 (a version with a premature queue depth of 16) reduces LUT usage by 43.91% and FF usage by 33.09%, with minimal impact on timing performance. Meanwhile, PreVV64 (a version with a premature queue depth of 64) reduces LUT usage by 27.21% and FF usage by 33.10%, without affecting timing performance.
11:15 CEST TS23.4 PEARL: FPGA-BASED REINFORCEMENT LEARNING ACCELERATION WITH PIPELINED PARALLEL ENVIRONMENTS
Speaker:
Jiayi Li, Peking University, CN
Authors:
Jiayi Li, Hongxiao Zhao, Wenshuo Yue, Yihan Fu, Daijing Shi, Anjunyi Fan, Yuchao Yang and Bonan Yan, Peking University, CN
Abstract
Reinforcement learning (RL) is an effective machine learning approach that enables artificial intelligence agents to perform complex tasks and make decisions in dynamic situations. Training an RL agent demands its repetitive interaction with the environment to learn optimal policies. To efficiently collect training data, parallelizing environments is a widely used technique by enabling simultaneous interactions between multiple agents and environments. However, existing CPU-based RL software frameworks face a key challenge of slow multi-environmental update computation. To solve this problem, we present a novel FPGA-based RL accelerating framework--PEARL. PEARL instantiates multiple parallel environments and accelerates them with a carefully designed pipeline scheme to hide data transfer latency within the computation time. We evaluate PEARL on respective RL environments and achieve 4.36× to 972.6× speedup over the existing fastest software-based framework for parallel environment execution. When scaling the number of environments from 1024 to 43008 (42×) in CliffWalking benchmark, the power consumption increases marginally by 3%, while LUT and flip-flops utilization rise by 2.24× and 3.08×, respectively. This demonstrates efficient resource usage and power management in PEARL. Further, PEARL allows users to define and add their environments within the framework flexibly. We have established an open-source repository for users to utilize and expand. We also implement PEARL with the existing RL algorithm and achieve acceleration. It is available online at https://github.com/Selinaee/FPGA_Gym.
11:20 CEST TS23.5 AISPGEMM: ACCELERATING IMBALANCED SPGEMM ON FPGAS WITH FLEXIBLE INTERCONNECT AND INTRA-ROW PARALLEL MERGING
Speaker:
Yuanfang Wang, Fudan University, CN
Authors:
Enhao Tang1, Shun Li2, Hao Zhou3, Guohao Dai3, Jun Lin4 and Kun Wang1
1Fudan University, CN; 2Southeast University, CN; 3Shanghai Jiao Tong University, CN; 4Nanjing University, CN
Abstract
The row-wise product algorithm shows significant potential for sparse matrix-matrix multiplication (SpGEMM) on hardware accelerators. Recent studies have made notable progress in accelerating SpGEMM using this algorithm. However, several challenges remain in accelerating imbalanced SpGEMM, where the distribution of non-zero elements across different rows is imbalanced. These challenges include: (1) the fixed dataflow of the merger tree, which leads to lower PE utilization, and (2) highly imbalanced data distributions, such as single rows with numerous non-zero elements, which result in intensive computations. This imbalance significantly challenges SpGEMM acceleration, leading to time-consuming processes that dominate overall computation time. In this paper, we propose AiSpGEMM to accelerate imbalanced SpGEMM on FPGAs. First, we improved the C2SR format to adapt it for imbalanced SpGEMM acceleration based on the row-wise product algorithm. This reduces off-chip memory bank conflicts and increases data reuse of matrix B. Secondly, we design a reconfigurable merger (R-merger) with flexible interconnects to improve PE utilization. Additionally, we propose an intra-row parallel merging algorithm and its corresponding hardware architecture, the parallel merger (P-merger), to accelerate intensive operations. Experimental results demonstrate that AiSpGEMM achieves a geometric mean (geomean) speedup of 5.8× compared to the state-of-the-art FPGA-based SpGEMM accelerator. In Geomean, AiSpGEMM achieves a 3.0× speedup and a 9.8× improvement in energy efficiency compared to the NVIDIA cuSPARSE library running on an NVIDIA A6000 GPU. Moreover, AiSpGEMM-21 demonstrated a 4× increase in average throughput compared to the same GPU.
11:25 CEST TS23.6 FAMERS: AN FPGA ACCELERATOR FOR MEMORY-EFFICIENT EDGE-RENDERED 3D GAUSSIAN SPLATTING
Speaker:
Yuanfang Wang, Fudan University, CN
Authors:
Yuanfang Wang, Yu Li, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN
Abstract
This paper introduces FAMERS, a tile-based hardware accelerator designed for efficient 3D Gaussian Splatting (3DGS) inference on edge-deployed Field Programmable Gate Arrays (FPGAs). 3DGS has emerged as a powerful technique for photorealistic image rendering, leveraging anisotropic Gaussians to balance computational efficiency and visual fidelity. However, the high memory and processing demands of 3DGS pose significant challenges for real-time applications on resource-constrained edge devices. To address these limitations, we present a novel architecture that optimizes both computational and memory overheads through model pruning and compression techniques, enabling high-quality rendering within the constrained memory and processing capabilities of edge platforms. Experimental results demonstrate that our implementation on the Xilinx XC7K325T FPGA achieves a 1.99× speedup and 13.46× energy efficiency compared to NVIDIA RTX 3060M Laptop GPU, underscoring the viability of our approach for real-time applications in virtual and augmented reality.
11:30 CEST TS23.7 SMARTMAP: ARCHITECTURE-AGNOSTIC CGRA MAPPING USING GRAPH TRAVERSAL AND REINFORCEMENT LEARNING
Speaker:
Ricardo Ferreira, Federal University of Viçosa, BR
Authors:
Fábio Ramos1, Pedro Realino1, Wagner Junior1, Alex Vieira2, Ricardo Ferreira1 and José Nacif1
1Federal University of Viçosa, BR; 2Federal University of Juiz de Fora, BR
Abstract
Coarse-Grained Reconfigurable Architectures (CGRAs) have been the subject of extensive research due to their balance between performance, energy efficiency, and flexibility. CGRAs must be capable of executing a dataflow graph (DFG), which depends on a compiler producing quality valid mappings with feasible running time performance and portable mapping DFGs on different CGRA architectures. Machine learning-based compilers have shown promising results by presenting high quality and performance but offer limited portability. Moreover, some approaches do not explore efficient placement methods or do not demonstrate whether scaling to more challenging, less connected architectures. This paper presents SmartMap, an architecture-agnostic framework that uses an actor-critic reinforcement learning method applied to a Monte-Carlo Tree Search (MCTS) to learn how to map a DFG onto a CGRA. This framework offers full portability using a state-action representation layer in the policy network instead of a probability distribution over actions. SmartMap uses a graph traversal placement method to provide scalability and improve the efficiency of MCTS by enabling more efficient exploration during the search. Our results show that SmartMap has 2.81x more mapping capacity, a 16.82x speed-up in compilation time, and consumes fewer resources compared to the state-of-the-art.
11:35 CEST TS23.8 DATAFLOW OPTIMIZED RECONFIGURABLE ACCELERATION FOR FEM-BASED CFD SIMULATIONS
Speaker:
Aggelos Ferikoglou, National TU Athens, GR
Authors:
Anastassis Kapetanakis, Aggelos Ferikoglou, Georgios Anagnostopoulos and Sotirios Xydis, National TU Athens, GR
Abstract
Computational Fluid Dynamics (CFD) simulations are essential for analyzing and optimizing fluid flows in a wide range of real-world applications. These simulations involve approximating the solutions of the Navier-Stokes differential equations using numerical methods, which are highly compute- and memory-intensive due to their need for high-precision iterations. In this work, we introduce a high-performance FPGA accelerator specifically designed for numerically solving the Navier-Stokes equations. We focus on the Finite Element Method (FEM) due to its ability to accurately model complex geometries and intricate setups typical of real-world applications. Our accelerator is implemented using High-Level Synthesis (HLS) on an AMD Alveo U200 FPGA, leveraging the reconfigurability of FPGAs to offer a flexible and adaptable solution. The proposed solution achieves 7.9x higher performance than optimized Vitis-HLS implementations and 45% lower latency with 3.64x less power compared to a software implementation on a high-end server CPU. This highlights the potential of our approach to solve Navier-Stokes equations more effectively, paving the way for tackling even more challenging CFD simulations in the future.
11:40 CEST TS23.9 A RESOURCE-AWARE RESIDUAL-BASED GAUSSIAN BELIEF PROPAGATION ACCELERATOR TOOLFLOW
Speaker:
Omar Sharif, Imperial College London, GB
Authors:
Omar Sharif and Christos Bouganis, Imperial College London, GB
Abstract
Gaussian Belief Propagation (GBP) is a graphical method of statistical inference that provides an approximate solution to the probability distribution of a system. In recent years, GBP has emerged as a powerful computational framework with numerous applications in domains such as SLAM and image processing. In pursuit of high performance efficiency (i.e., inference per watt), streaming-based reconfigurable hardware solutions have demonstrated significant performance gains compared to leading-edge processors and high-power, server-grade CPUs. However, this class of architectures suffers from performance degradation at scale when on-chip memory is limited. This paper addresses this challenge by building on previous GBP architectural and algorithmic developments, introducing a novel hardware method that dynamically prioritizes node computations by monitoring information gain. By leveraging the inherent properties of the GBP algorithm, we demonstrate how convergence-driven optimizations can push the performance envelope of state-of-the-art reconfigurable accelerators despite on-chip memory constraints. The performance of our architecture is rigorously evaluated against this across both real-world and synthetic SLAM and image-denoising benchmarks. For equal resources, our work achieves a convergence rate improvement of up to 3.5x for large graphs, demonstrating its effectiveness for real-time inference tasks.
11:45 CEST TS23.10 UNIT: A HIGHLY UNIFIED AND MEMORY-EFFICIENT FPGA-BASED ACCELERATOR FOR TORUS FHE
Speaker:
Yuying ZHANG, The Hong Kong University of Science and Technology, HK
Authors:
Yuying ZHANG1, Sharad Sinha2, Jiang Xu3 and Wei Zhang1
1The Hong Kong University of Science and Technology, HK; 2Indian Institute of Technology (IIT) Goa, IN; 3The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
Fully Homomorphic Encryption (FHE) has emerged as a promising solution for the secure computation on encrypted data without leaking user privacy. Among various FHE schemes, Torus FHE (TFHE) distinguishes itself by its ability to perform exact computations on non-linear functions within the encrypted domain, satisfying the crucial requirement for privacy-preserving AI applications. However, the high computational overhead and strong data dependency in TFHE's bootstrapping process present significant challenges to its practical adoption and efficient hardware implementation. Existing TFHE accelerators on various hardware platforms still face limitations in terms of performance, flexibility, and area efficiency. In this work, we propose UNIT, a novel and highly unified accelerator for Programmable Bootstrapping (PBS) in TFHE, featuring carefully designed computation units. We introduce a unified architecture for negacyclic (inverse) number theoretic transform (I)NTT with fused twisting steps, which reduces computing resources by 33% and the memory utilization of pre-stored factors by nearly 66%. Another key feature of UNIT is the innovative design of the monomial number theoretic transform unit, called OF-MNTT, which leverages on-the-fly twiddle factor generation to eliminate memory traffic and overhead. This memory-efficient and highly parallelizable approach for MNTT is proposed for the first time in TFHE acceleration. Furthermore, UNIT is highly reconfigurable and scalable, supporting various parameter sets and performance-resource requirements. Our proposed accelerator is evaluated on the Xilinx Alveo U250 FPGA platform. Experimental results demonstrate its superior performance compared to the state-of-the-art GPU and FPGA-based implementations with the improvement of 8.3x and 3.63x, respectively. In comparison with the most advanced FPGA implementation, UNIT achieves 30% enhanced area efficiency and 3.2x reduced power with much better flexibility.
11:50 CEST TS23.11 RGHT-Q: RECONFIGURABLE GEMM UNIT FOR HETEROGENEOUS-HOMOGENEOUS TENSOR QUANTIZATION
Speaker:
Seungho Lee, Sungkyunkwan University, KR
Authors:
Seungho Lee, Donghyun Nam and Jeongwoo Park, Sungkyunkwan University, KR
Abstract
The high computational demands of large language models (LLMs) are limited by the lack of GPU hardware support for heterogeneous quantization, which mixes integers and floating points. To address this limitation, we propose an LLM processing element (PE), RGHT-Q, which features reconfigurable general-matrix multiplication (GEMM) for both heterogeneous and homogeneous tensor quantization. The RGHT-Q introduces a novel design that leverages butterfly routing and multi-precision multipliers. As a result, we achieve significant performance improvements, offering 3.14× higher energy efficiency, and 1.56× better area efficiency compared to prior designs.

LK03 Special Day Emerging Computing Paradigms Lunchtime Keynote

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 13:15 CEST - 14:00 CEST

Time Label Presentation Title
Authors
13:15 CEST LK03.1 SPECIAL DAY EMERGING COMPUTING PARADIGM LUNCHTIME KEYNOTE
Presenter:
Christian Mayr, TU Dresden, DE
Author:
Christian Mayr, TU Dresden, DE
Abstract
.

FS04 Focus Session - Designing Secure Space Systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST

Session chair:
Sebastian Steinhorst, TU Munich, DE

Session co-chair:
Daniel Lüdtke, German Aerospace Center (DLR), DE

Organisers:
Sebastian Steinhorst, TU Munich, DE
Michael Felderer, German Aerospace Center (DLR), DE

As the scope of space exploration expands, the need for robust cybersecurity measures has become more urgent than ever. Nowadays, private companies are entering the space sector, leading to a big increase in satellite launches and space activities. While this expansion reduces launch costs, it also elevates the risk of cyber threats. Historically, cybersecurity in space has been overlooked, leaving critical vulnerabilities exposed. This hot-topic session will bring together four experts from industry, government, and research to tackle the critical challenges and explore innovative solutions in building a secure space ecosystem.

Time Label Presentation Title
Authors
14:00 CEST FS04.1 OFFENSIVE SECURITY TESTING FOR SPACE SYSTEMS
Presenter:
Milenko Starcik, VisionSpace Technologies GmbH, DE
Author:
Milenko Starcik, VisionSpace Technologies GmbH, DE
Abstract
Space missions, especially commercial space systems, are targeted by state-backed Advanced Persistent Threat (APT) actors since they increasingly share capacity between government and private users. The attacks often exploit legacy hardware, software, and outdated protocols. Legacy system vulnerabilities and the effects of the COVID-19 pandemic have further exposed space systems to potential exploitation. Recently, there have been incidents, such as attacks on satellite terminals with the widespread impact of the 2022 ViaSat incident, showing how legacy systems have led to a security breach. While the space systems community has a strong safety and test engineering history, security validation is often neglected. Our security research on currently used space protocols, mission control software, and spacecraft onboard software frameworks shows that security measures are still not applied throughout the space mission life cycle.
14:22 CEST FS04.2 LOCKING YOUR DOOR DOES NOT MAKE YOU SECURE AT YOUR HOME, SIMILARLY YOUR SATELLITE!
Presenter:
Zain Hammadeh, German Aerospace Center (DLR), DE
Author:
Zain Hammadeh, German Aerospace Center (DLR), DE
Abstract
Securing the link between the ground segment and the satellite is essential to protect the satellite from cyber-attacks. Solutions including end-to-end encryption can help avoid attacks like spoofing and reply attacks. However, developers of on-board software should not assume that a satellite environment is secure, especially in an era where a satellite will serve as an execution service for 3rd party software, which can be malicious. Efficient intrusion detection systems (IDS) are essential for monitoring network traffic and system behavior to identify malicious activities in real-time. Additionally, an effective intrusion response mechanism must be in place to ensure that the satellite can continue functioning even under attack. This requires a fail-operational mode that guarantees essential systems remain operational while isolating and neutralizing compromised components. Given the constraints on computational resources in space systems, these security solutions must be optimized for low-latency response and minimal resource consumption, all while ensuring high reliability and resilience against evolving cyber threats.
14:45 CEST FS04.3 SECURITY ENGINEERING (NOT JUST) FOR SPACE
Presenter:
Stefan Langhamme, OHB Digital Connect GmbH, DE
Author:
Stefan Langhamme, OHB Digital Connect GmbH, DE
Abstract
"s space exploration advances and the commercialization of space technologies grows, the security of space assets has become a critical concern. In a related trend the use of "off the shelf" hard- and software facilitates the commercial use of space, but also creates new attack surfaces. This creates a need for off the shelf solutions for security risks. And while a lot of very good solutions exist, experience shows that adding "security" to a system does not automatically lead to an increase in security. This was just recently demonstrated by the global IT outage caused by the CrowdStrike security software. What is needed is the integration of cybersecurity into the engineering lifecycle. In this talk we will investigate ways in which the diverse field of cybersecurity - ranging from organisational and management questions to deeply technical topics – can be integrated into the engineering lifecycle of space systems. The underlying aim is improving the security stance of the system without adding new problems or unnecessary complexity. Key areas covered include threat modelling, risk assessment, secure software and hardware design, encryption, and response strategies. Our aim is to deepen the listeners understanding of what security is, how to achieve it and how to learn from mistakes made in "non-space" IT systems."
15:07 CEST FS04.4 A JOINT EFFORT: STANDARDIZATION OF CYBERSECURITY IN SPACE
Presenter:
Florian Göhler, Germany's Federal Office for Information Security (BSI), DE
Author:
Florian Göhler, Germany's Federal Office for Information Security (BSI), DE
Abstract
Cybersecurity should be an integrated part of every space mission, and security aspects need to be considered throughout all phases of a project. However, there is a lack of universally applicable security standards that address cyberthreats in space, as existing security standards often miss security measures against space-specific threats. Especially small institutions, start-ups, and research facilities suffer from this lack of guidance, but the issue is also pressing for established industry stakeholders. To overcome this situation, the German Federal Office for Information Security founded an expert group for cybersecurity in space that invites experts from governmental institutions, industry, and academics to work together on standardization and regulation. In a joint effort and based on existing standards, the expert group developed multiple documents that aim to mitigate cyberthreats on space and ground segments. These guidelines focus on every life cycle phase of a space mission, and they are adaptable to the scope and complexity of any given project. Furthermore, the expert group aims to identify emerging new technologies and regulations that may impact cybersecurity in space. These efforts also take international developments into account.

HSD03 HackTheSilicon DATE

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 18:00 CEST


LKS04 Later … with the keynote speakers

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:00 CEST


TS24 Logical analysis and design

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST TS24.1 HYBRID EXACT AND HEURISTIC EFFICIENT TRANSISTOR NETWORK OPTIMIZATION FOR MULTI-OUTPUT LOGIC
Speaker:
Lang Feng, Sun Yat-sen University, CN
Authors:
Lang Feng1, Rongjian Liang2 and Hongxin Kong3
1Sun Yat-sen University, CN; 2NVIDIA Corp., US; 3Texas A&M University, US
Abstract
With the approaching post-Moore era, it is becoming increasingly impractical to decrease the transistor size in digital VLSI for better performance. To address this issue, one approach is to optimize the digital circuit at the transistor level to reduce the transistor count. Although previous works have explored ways to conduct transistor network optimization, most of these efforts have focused on single-output networks or applied heuristics only, limiting their scope or optimization quality. In this paper, we propose an exact transistor network optimization algorithm that supports multi-output logic and is formulated as a SAT problem. Our approach maintains a high optimization level by employing the exact algorithm, while also incorporating a hybrid process that uses a heuristic algorithm to predict the solution range as a guidance for better efficiency. Experimental results show that the proposed algorithm has a 5.32% better optimization level given 54% less runtime compared with the state-of-the-art work.
14:05 CEST TS24.2 MAXIMUM FANOUT-FREE WINDOW ENUMERATION: TOWARDS MULTI-OUTPUT SUB-STRUCTURE SYNTHESIS
Speaker:
Ruofei TANG, Hong Kong Baptist University, CN
Authors:
Ruofei Tang1, Xuliang Zhu2, Xing Li3, Lei Chen4, Xin Huang1, Mingxuan Yuan4 and Jianliang Xu5
1Hong Kong Baptist University, HK; 2Antai College of Economics and Management, Shanghai Jiaotong University, CN; 3Huawei Noah's Ark Lab, CN; 4Huawei Noah's Ark Lab, HK; 5HKBU, CN
Abstract
Peephole optimization is commonly used in And-Inverter Graphs (AIGs) optimization algorithms. The efficiency of these algorithms heavily relies on the enumeration process of sub-structures. One common sub-structure is the cut, known for its efficient enumeration method and single-output characteristic. However, an increasing number of optimization algorithms now target sub-structures that incorporate multiple outputs. In this paper, we explore Maximum Fanout-Free Windows (MFFWs), a novel sub-structure with a multi-output nature, as well as its practical applications and enumeration algorithms. To accommodate various algorithm execution processes, we propose two different enumeration styles: Dynamic and Static. The Dynamic approach provides flexibility in adapting to changes in the AIG structure, whereas the Static method ensures efficiency as long as the AIG structure remains unchanged during execution. We apply these methods to rewriting and technology mapping to improve their runtime performance. Experimental results on pure enumeration and practical scenarios show the scalability and efficiency of the proposed MFFW enumeration methods.
14:10 CEST TS24.3 SIMGEN: SIMULATION PATTERN GENERATION FOR EFFICIENT EQUIVALENCE CHECKING
Speaker:
Carmine Rizzi, ETH Zurich, CH
Authors:
Carmine Rizzi1, Sarah Brunner1, Alan Mishchenko2 and Lana Josipovic1
1ETH Zurich, CH; 2University of California, Berkeley, US
Abstract
Combinational equivalence checking for hardware design tends to be slow due to the number and complexity of intermediate node equivalences considered by the SAT solver. This is because the solver often spends extensive time disproving nodes that appear equivalent under random simulation. We propose SimGen, an open-source and expressive simulation pattern generator inspired by Automatic Test Pattern Generation (ATPG); it exploits the circuit's structure and logic information to disprove the equivalence of circuit nodes and avoid excessive SAT calls. We demonstrate the effectiveness of SimGen's simulation patterns over those generated by state-of-the-art random and guided simulation.
14:15 CEST TS24.4 ELMAP: AREA-DRIVEN LUT MAPPING WITH K-LUT NETWORK EXACT SYNTHESIS
Speaker:
Hongyang Pan, Fudan University, CN
Authors:
Hongyang Pan1, Keren Zhu1, Fan Yang1, Zhufei Chu2 and Xuan Zeng1
1Fudan University, CN; 2Ningbo University, CN
Abstract
Mapping to k-input lookup tables (k-LUTs) is a critical process in field-programmable gate array (FPGA) synthesis. However, the structure of the subject graph can introduce structural bias, which refers to the dependency of mapping results on the inherent graph structure, often leading to suboptimal results. To address this, we present ELMap, an area-driven LUT mapping framework. It incorporates structural choice during the collapsing phase. This enables dynamic decomposition, maximizing local-to-global optimization transfer. To ensure seamless integration between the optimization and mapping processes, ELMap leverages exact k-LUT synthesis to generate area-optimal sub-LUT networks. Experiments on the EPFL benchmark suite demonstrate that ELMap significantly outperforms state-of-theart methods. Specifically, in 6-LUT mapping, ELMap reduces the average LUT area by 8.5% and improves the area-depthproduct (ADP) by 5.8%. In 4-LUT remapping, it reduces the average LUT area by 17.6% and improves the ADP by 2.4%.
14:20 CEST TS24.5 APPLICATION OF FORMAL METHODS (SAT/SMT) TO THE DESIGN OF CONSTRAINED CODES
Speaker:
Sunil Sudhakaran, Student, US
Authors:
Sunil Sudhakaran1, Clark Barrett2 and Mark Horowitz2
1Student, US; 2Stanford University, US
Abstract
Constrained coding plays a crucial role in high-speed communication links by restricting bit sequences to reduce the adverse effects imposed by the characteristics of the channel. This technique trades off some bit efficiency for higher transmission rates, thereby boosting overall data throughput. We show how the design of hardware-efficient translation logic to and from the restricted code space can be formulated as a Satisfiability Modulo Theories (SMT) problem. Using SMT, we can not only try to minimize the complexity of this logic and limit the effect of transmission errors on the final decoded output, but also significantly reduce development time—from weeks to just hours. Our initial results demonstrate the efficiency and effectiveness of this approach.
14:25 CEST TS24.6 WIDEGATE: BEYOND DIRECTED ACYCLIC GRAPH LEARNING IN SUBCIRCUIT BOUNDARY PREDICTION
Speaker:
Jiawei Liu, Beijing University of Posts and Telecommunications, CN
Authors:
Jiawei Liu1, Zhiyan Liu1, Xun He1, Jianwang Zhai1, Zhengyuan Shi2, Qiang Xu2, Bei Yu2 and Chuan Shi1
1Beijing University of Posts and Telecommunications, CN; 2The Chinese University of Hong Kong, HK
Abstract
Subcircuit boundary prediction is an important application of machine learning in logical analysis, effectively supporting tasks such as functional verification and logic optimization. Existing methods often convert circuits into and-inverter graphs and then use directed acyclic graph neural networks to perform this task. However, two key characteristics of subcircuit boundary prediction do not align with the fundamental assumptions of DAG learning, which limits the model's expressiveness and generalization capabilities. To break these assumptions, we propose WideGate, which includes a receptive field generation module that extends beyond the fanin cone and fanout cone, as well as an adaptive aggregation module that focuses on boundaries. Extensive experiments show that WideGate significantly outperforms existing methods in terms of prediction accuracy and training efficiency for subcircuit boundary prediction. The code is available at https://github.com/BUPT-GAMMA/WideGate.
14:30 CEST TS24.7 BIAS BY DESIGN: DIVERSITY QUANTIFICATION TO MITIGATE STRUCTURAL BIAS EFFECTS IN AIG LOGIC OPTIMIZATION
Speaker:
Isabella Venancia Gardner, Universiteit van Amsterdam, NL
Authors:
Isabella Venancia Gardner1, Marcel Walter2, Yukio Miyasaka3, Robert Wille2 and Michael Cochez4
1Universiteit van Amsterdam, NL; 2TU Munich, DE; 3University of California, Berkeley, US; 4Vrije Universiteit Amsterdam, NL
Abstract
And-Inverter Graphs (AIGs) are a fundamental data structure in logic optimization and are widely used in modern electronic design automation. A persistent challenge in AIG optimization is structural bias, where the initial graph structure significantly influences optimization quality by restricting the search space, often resulting in suboptimal outcomes. Existing methods address this issue by running multiple optimization workflows in parallel, relying on a trial-and-error approach that lacks a systematic way to measure structural diversity or assess effectiveness, making them computationally expensive and inefficient. This paper introduces a novel framework for systematically evaluating and reducing structural bias by measuring structural diversity, defined as the degree of dissimilarity between AIG graphs. Several traditional graph similarity measures and newly proposed AIG-specific metrics, including the Rewrite, Refactor, and Resub Scores, are explored. Results reveal limitations in traditional graph similarity metrics and highlight the effectiveness of the proposed AIG-specific measures in quantifying structural dissimilarity. Notably, the RRR Score shows a strong correlation (Pearson correlation coefficient, r = 0.79) with post-optimization structural differences, demonstrating the reliability of the metric in capturing meaningful variations between AIG structures. This work addresses the challenge of quantifying structural bias and offers a methodology that could potentially improve optimization outcomes, with future extensions applicable to other types of logic graphs.
14:35 CEST TS24.8 TIMING-DRIVEN APPROXIMATE LOGIC SYNTHESIS BASED ON DOUBLE-CHASE GREY WOLF OPTIMIZER
Speaker:
Xiangfei Hu, Southeast University, CN
Authors:
Xiangfei Hu1, Yuyang Ye2, Tinghuan Chen3, Hao Yan1 and Bei Yu2
1Southeast University, CN; 2The Chinese University of Hong Kong, HK; 3The Chinese University of Hong Kong, Shenzhen, CN
Abstract
With the shrinking technology nodes, timing optimization becomes increasingly challenging. Approximate logic synthesis (ALS) can perform local approximate changes (LACs) on circuits to optimize timing with the cost of slight inaccuracy. However, existing ALS methods that focus solely on critical path depth reduction or area minimization are not optimal in timing optimization. This paper proposes an effective timing-driven ALS framework, where we employ a double-chase grey wolf optimizer to explore and apply LACs, simultaneously bringing excellent critical path shortening and area reduction under error constraints. Subsequently, it utilizes post-optimization under area constraints to convert area reduction into further timing improvement, thus achieving maximum critical path delay reduction. According to experiments on open-source circuits with 28nm technology, compared to the SOTA method, our framework can generate approximate circuits with greater critical path delay reduction under different error and area constraints.
14:40 CEST TS24.9 IRW: AN INTELLIGENT REWRITING
Speaker:
Haisheng Zheng, Shanghai Artificial Intelligence Laboratory, CN
Authors:
Haisheng Zheng1, Haoyuan WU2, Zhuolun He2, Yuzhe Ma3 and Bei Yu2
1Shanghai AI Laboratory, CN; 2The Chinese University of Hong Kong, HK; 3The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
This paper proposes a novel machine learning-driven rewriting algorithm to optimize And-Inverter Graphs (AIGs) for refining combinational logic prior to technology mapping. The algorithm, called iRw, iteratively extracts subcircuits in AIGs and replaces them with more streamlined implementations. These subcircuits are identified using an original extraction algorithm, while the compact implementations are produced through rewriting techniques guided by a machine learning model. This approach efficiently enables the generation of logically equivalent subcircuits with minimal overhead. Experiments on benchmark circuits indicate that the proposed methodology outperforms state-of-the-art AIG rewriting techniques in both quality and runtime.
14:41 CEST TS24.10 AUTOMATIC ROUTING FOR PHOTONIC INTEGRATED CIRCUITS UNDER DELAY MATCHING CONSTRAINTS
Speaker:
Yuchao Wu, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Yuchao Wu1, Weilong Guan1, Yeyu Tong2 and Yuzhe Ma1
1The Hong Kong University of Science and Technology (Guangzhou), CN; 2The Hong Kong University of Science and Technology (Guangzhou)), CN
Abstract
Optical interconnects have emerged as a promising solution for rack-, board-scale, and even in-package communications, thanks to their high available optical bandwidth and minimal latency. However, the optical waveguides are intrinsically different from traditional metal wires, especially the phase matching constraints, which impose new challenges for routing in the photonic integrated circuits design. In this paper, we propose a comprehensive and efficient optical routing framework that introduces a diffuse-based length-matching method and bend modification methods to ensure phase-matching constraints. Furthermore, we present a congestion-based A* formulation with a negotiated congestion-based rip-up and reroute strategy on new rectangular grids with an aspect ratio of 1:√3 to reduce insertion loss. Experimental results based on real photonic integrated designs show that our optical routing flow can reduce total insertion loss by 11% and maximum insertion loss by 108%, while effectively satisfying matching constraints, compared to manual results.
14:42 CEST TS24.11 ML-BASED AIG TIMING PREDICTION TO ENHANCE LOGIC OPTIMIZATION
Speaker:
Sachin Sapatnekar, University of Minnesota, US
Authors:
Wenjing Jiang1, Jin Yan2 and Sachin S. Sapatnekar1
1University of Minnesota, US; 2Google, US
Abstract
Traditional logic optimization relies on proxy metrics to approximate post-mapping performance and area, which may not correlate well with post-mapping delay and area. This paper explore a ground-truth-based optimization flow that directly incorporates the post-mapping delay and area during optimization using decision tree-based machine learning models. Results show high prediction accuracy and generalization to unseen designs.

TS25 Design and Test for Machine Learning and Machine Learning for Design and Test

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST TS25.1 HYATTEN: HYBRID PHOTONIC-DIGITAL ARCHITECTURE FOR ACCELERATING ATTENTION MECHANISM
Speaker:
Huize Li, National University of Singapore, SG
Authors:
Huize Li, Dan Chen and Tulika Mitra, National University of Singapore, SG
Abstract
The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digital-based accelerators, there is growing interest in exploring photonics due to its high energy efficiency and ultra-fast processing speeds. However, the significant signal conversion overhead limits the performance of photonic-based accelerators. In this work, we propose HyAtten, a photonic-based attention accelerator with minimize signal conversion overhead. HyAtten incorporates a signal comparator to classify signals into two categories based on whether they can be processed by low-resolution converters. HyAtten integrates low-resolution converters to process all low resolution signals, thereby boosting the parallelism of photonic computing. For signals requiring high-resolution conversion, HyAtten uses digital circuits instead of signal converters to reduce area and latency overhead. Compared to state-of-the-art photonicbased Transformer accelerator, HyAtten achieves 9.8× performance/area and 2.2× energy-efficiency/area improvement.
14:05 CEST TS25.2 SEGA-DCIM: DESIGN SPACE EXPLORATION-GUIDED AUTOMATIC DIGITAL CIM COMPILER WITH MULTIPLE PRECISION SUPPORT
Speaker:
Haikang Diao, Peking University, CN
Authors:
Haikang Diao, Haoyi Zhang, Jiahao Song, Haoyang Luo, Yibo Lin, Runsheng Wang, Yuan Wang and Xiyuan Tang, Peking University, CN
Abstract
Digital computing-in-memory (DCIM) has been a popular solution for addressing the memory wall problem in recent years. However, the DCIM design still heavily relies on manual efforts, and the optimization of DCIM is often based on human experience. These disadvantages limit the time to market while increasing the design difficulty of DCIMs. This work proposes a design space exploration-guided automatic DCIM compiler (SEGA-DCIM) with multiple precision support, including integer and floating-point data precision operations. SEGA-DCIM can automatically generate netlists and layouts of DCIM designs by leveraging a template-based method. With a multi-objective genetic algorithm (MOGA)-based design space explorer, SEGA-DCIM can easily select appropriate DCIM designs for a specific application considering the trade-offs among area, power, and delay. As demonstrated by the experimental results, SEGA-DCIM offers solutions with wide design space, including integer and floating-point precision designs, while maintaining competitive performance compared to state-of-the-art (SOTA) DCIMs.
14:10 CEST TS25.3 SOFTMAP: SOFTWARE-HARDWARE CO-DESIGN FOR INTEGER-ONLY SOFTMAX ON ASSOCIATIVE PROCESSORS
Speaker:
Mariam Rakka, University of California, Irvine, US
Authors:
Mariam Rakka1, Jinhao Li2, Guohao Dai3, Ahmed Eltawil4, Mohammed Fouda5 and Fadi Kurdahi1
1University of California, Irvine, US; 2Shanghai Jiao Tong University, CN; 3Qingyuan Research Institute, Shanghai Jiao Tong University, CN; 4King Abdullah University of Science and Technology, SA; 5Rain AI, US
Abstract
Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.
14:15 CEST TS25.4 COMPREHENSIVE RISC-V FLOATING-POINT VERIFICATION: EFFICIENT COVERAGE MODELS AND CONSTRAINT-BASED TEST GENERATION
Speaker:
Tianyao Lu, College of Information Science and Electronic Engineering, Zhejiang University, CN
Authors:
Tianyao Lu, Anlin Liu, Bingjie Xia and Peng Liu, Zhejiang University, CN
Abstract
The increasing complexity of processor architectures necessitates more rigorous functional verification. Floating-point operations, in particular, present significant challenges due to their extensive range of computational cases that require verification. This paper proposes a comprehensive approach for generating floating-point instruction sequences to enhance the verification of RISC-V. We introduce a constraint-based method for floating-point test generation and design efficient coverage models as input constraints for this process. The resulting representative floating-point tests are integrated with RISC-V instruction sequence generation through a memory-bound register update method. Experimental results demonstrate that our approach improves the functional coverage of RISC-V floating-point instruction sequences from 93.32% to 98.34%, while simultaneously reducing the number of required instructions by 66.67% compared to the Google RISCV-DV generator. Additionally, our method achieves more comprehensive coverage of floating-point types in instruction write-back data compared to RISCV-DV. Using the proposed approach, we successfully detect representative floating-point-related faults injected into the RISC-V processor CV32E40P, thereby demonstrating its effectiveness.
14:20 CEST TS25.5 WINACC: WINDOW-BASED ACCELERATION OF NEURAL NETWORKS USING BLOCK FLOATING POINT
Speaker:
Xin Ju, National University of Defense Technology, CN
Authors:
Xin Ju, Jun He, Mei Wen, Jing Feng, Yasong Cao, Junzhong Shen, Zhaoyun Chen and Yang Shi, National University of Defense Technology, CN
Abstract
Deep Neural Networks (DNNs) impose significant computational demands, necessitating optimizations for computational and energy efficiencies. Per-vector scaling, which applies a scaling factor to blocks of elements using narrow integer types, effectively reduces storage and computational overhead. However, the frequent occurrence of floating-point accumulations between vectors limits further improvements in energy efficiency. State-of-the-art accelerators address this challenge by grouping and summing vector products based on their exponent differences, thereby reducing the overhead associated with intra-group shifting and accumulation. Nevertheless, this approach increases the complexity of register usage and grouping logic, leading to limited energy benefits and hardware efficiency. In this context, we introduce WinAcc, a novel algorithm and architecture co-designed solution that utilizes a low-cost accumulator to handle the majority of data in DNNs, offering low area overhead and high energy efficiency gains. Our key insight is that the data of DNNs follows a Laplace-like distribution, which enables the use of a customized data format with a narrow dynamic range to encode most of the data. This allows for the design of a low-cost accumulator with narrow shifters and adders, significantly reducing reliance on floating-point accumulator and consequently improving energy efficiency. Compared with state-of-the-art architecture Bucket, WinAcc achieves 33.95% energy reduction across seven representative DNNs and reduces area by 9.5% while maintaining superior model performance.
14:25 CEST TS25.6 SACPLACE: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR SYMMETRY-AWARE ANALOG CIRCUIT PLACEMENT
Speaker:
Lei Cai, Wuhan Unversity of Technology, CN
Authors:
Lei Cai1, Guojing Ge2, Guibo Zhu2, Jixin Zhang3, Jinqiao Wang2, Bowen Jia1 and Ning Xu1
1Wuhan University of Technology, CN; 2Institute of Automation, Chinese Academic of Science, CN; 3Hubei University of Technology, CN
Abstract
The placement of analog Integrated Circuits (ICs) plays a critical role in their physical design. The objective is to minimize the Half-Perimeter Wire Length (HPWL) while satisfying complex analog IC constraints, such as symmetry. Unlike digital ICs, analog ICs are highly sensitive to parasitic effects, making device symmetry crucial for optimal circuit performance. However, existing methods, including both machine learning-based and analytical approaches, struggle to meet strict symmetry constraints. In machine learning-based methods, training a general model is challenging due to the limited diversity of the training data. In analytical methods, the difficulty lies in formulating symmetry constraints as a convex function, which is necessary for gradient-based optimization of the placement. To address the issue, we formulate the placement process as a Markov decision process and propose SACPlace, a multi-agent deep reinforcement learning method for Symmetry-Aware analog Circuit Placement. SACPlace initially extracts layout information and various constraints as the input information for placement refinement and evaluation. Subsequently, SACPlace constructs multi-agent policy networks for symmetry-aware placement by refining placement guided by the evaluation of optimal symmetry quality. Following this, SACPlace constructs multi-layer perceptron-based critic networks to embed placement information for evaluating symmetry quality. This evaluation reward will be used for guiding placement refinement. Experimental results from four public analog IC datasets demonstrate that our method achieves the lowest HPWL while fully satisfying symmetry and common constraints, outperforming state-of-the-art methods. Additionally, simulation results on real-world analog ICs show better performance than these methods and even manual designs.
14:30 CEST TS25.7 LINEARIZATION OF QUADRATURE DIGITAL POWER AMPLIFIERS BY NEURAL NETWORK OF ULR_LSTM: UNSUPERVISED LEARNING RESIDUAL LSTM
Speaker:
Jiayu Yang, State Key Laboratory of Integrated Chips and Systems, School of Microelectronics, Fudan University, Shanghai, China, CN
Authors:
Jiayu Yang, Luyi Guo, Yicheng Li, Wang Wang, Zixu Li, Manni Li, Zijian Huang, Yinyin Lin, Yun Yin and Hongtao Xu, Fudan University, CN
Abstract
For the first time, this paper presents an unsupervised learning residual long short-term memory (ULR_LSTM) neural network to develop a digital predistortion (DPD) method for the linearization of digital power amplifiers (DPAs). Our method eliminates the need for iterative learning control (ILC) to obtain the ideal input of the DPA required by state-of-the-arts (SOTAs), which leads to high computational complexity and extensive training time. We perform behavioral modeling of the DPA using the R_LSTM network. After determining the optimal behavioral model architecture, the corresponding DPD model is obtained through an inverse training process. A 15-bit transformer-based quadrature DPA chip incorporating Class-G and IQ-cell-sharing techniques was implemented in a 28nm CMOS process to validate our proposed method. Experimental results demonstrate outstanding linearization performance comparing to prior arts, achieving an error vector magnitude (EVM) of -40.4dB for the 802.11ax 40MHz 64QAM signal.
14:35 CEST TS25.8 COMPATIBILITY GRAPH ASSISTED AUTOMATIC HARDWARE TROJAN INSERTION FRAMEWORK
Speaker:
Anjum Riaz, IIT Jammu, IN
Authors:
Gaurav Kumar, Ashfaq Shaik, Anjum Riaz, Yamuna Prasad and Satyadev Ahlawat, IIT Jammu, IN
Abstract
Hardware Trojans (HTs) pose substantial security threats to Integrated Circuits (ICs), compromising their integrity, confidentiality, and functionality. Various HT detection methods have been developed to mitigate these risks. However, the limited availability of comprehensive HT benchmarks necessitates designers to create their own for evaluation purposes. Moreover, the existing benchmarks exhibit several deficiencies, including a restricted range of trigger nodes, susceptibility to detection through random patterns, lengthy HT instance creation and validation process, and a limited number of HT instances per circuit. To address these limitations, we propose a Compatibility Graph assisted automatic Hardware Trojan insertion framework for HT benchmark generation. Given a netlist, this framework generates a design incorporating single or multiple HT instances according to user-defined properties. It allows various configurations of HTs, such as a large number of trigger nodes, low activation probability and large number of unique HT instances. The experimental results demonstrate that the generated HT benchmarks exhibit exceptional resistance to state-of-the-art HT detection schemes. Additionally, the proposed framework achieves an average improvement of 37815.7x and 989.4x over the insertion times of the Random and Reinforcement Learning based HT insertion frameworks, respectively.
14:40 CEST TS25.9 TOWARDS ROBUST RRAM-BASED VISION TRANSFORMER MODELS WITH NOISE-AWARE KNOWLEDGE DISTILLATION
Speaker:
Wenyong Zhou, The University of Hong Kong, HK
Authors:
Wenyong Zhou, Taiqiang Wu, Chenchen Ding, Yuan Ren, Zhengwu Liu and Ngai Wong, The University of Hong Kong, HK
Abstract
Resistive random-access memory (RRAM)-based compute-in-memory (CIM) systems show promise in accelerating Transformer-based vision models but face challenges from inherent device non-idealities. In this work, we systematically investigate the vulnerability of Transformer-based vision models to RRAM-induced perturbations. Our analysis reveals that earlier Transformer layers are more vulnerable than later ones, and feed-forward networks (FFNs) are more susceptible to noise than multi-head self-attention (MHSA). Based on these observations, we propose a noise-aware knowledge distillation framework that enhances model robustness by aligning both intermediate features and final outputs between weight-perturbed and noise-free models. Experimental results demonstrate that our method improves accuracy by up to 1.54% and 1.49% on ViT and DeiT models under various noise conditions compared to their vanilla counterparts.
14:41 CEST TS25.10 HYIMC: ANALOG-DIGITAL HYBRID IN-MEMORY COMPUTING SOC FOR HIGH-QUALITY LOW-LATENCY SPEECH ENHANCEMENT
Speaker:
Wanru Mao, Beihang University, CN
Authors:
Wanru Mao1, Hanjie Liu1, Guangyao Wang1, Tianshuo Bai1, Jingcheng Gu1, Han Zhang1, Xitong Yang2, Aifei Zhang2, Xiaohang Wei2, Meng Wang2 and Wang Kang1
1Beihang University, CN; 2Zhicun Research Lab, CN
Abstract
In-memory computing (IMC) holds significant promise for accelerating deep learning-based speech enhancement (DL-SE). However, existing IMC architectures face challenges in simultaneously achieving high precision, energy efficiency, and the necessary parallelism for DL-SE's inherent temporal dependencies. This paper introduces HyIMC, a novel hybrid analog-digital IMC architecture designed to address these limitations. HyIMC features: 1) a hybrid analog-digital design optimized for DL-SE algorithms; 2) a schedule controller that efficiently manages recurrent dataflow within skip connections; and 3) non-key dimension shrinkage, a model compression technique that preserves accuracy. Implemented on a 40nm eFlash-based IMC SoC prototype, HyIMC achieves 160 TOPS/W energy efficiency, compresses the DL-SE model size by ∼600%, improves the feature of merit by ∼1200%, and enhances perceptual evaluation of speech quality by ∼120%.

TS26 Design and test for analog and mixed-signal circuits / systems / MEMS

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST TS26.1 INTO-OA: INTERPRETABLE TOPOLOGY OPTIMIZATION FOR OPERATIONAL AMPLIFIERS
Speaker:
Jinyi Shen, Fudan University, CN
Authors:
Jinyi Shen, Fan Yang, Li Shang, Zhaori Bi, Changhao Yan, Dian Zhou and Xuan Zeng, Fudan University, CN
Abstract
This paper presents INTO-OA, an interpretable topology optimization method for operational amplifiers (op-amps). We propose a Bayesian optimization-based approach to effectively explore the high-dimensional, discrete topology design space of op-amps. Our method integrates a Gaussian process surrogate model with the Weisfeiler-Lehman graph kernel to extract structural features from a dedicated circuit graph representation. It also employs a candidate generation strategy that combines random sampling with mutation to balance global exploration and local exploitation. Additionally, INTO-OA enhances interpretability by assessing the impact of circuit structures on performance, providing designers with valuable insights into generated topologies and enabling the interpretable refinement of existing designs. Experimental results demonstrate that INTO-OA achieves higher success rates, a 1.84× to 19.10× improvement in op-amp performance, and a 3.20× to 14.33× increase in topology optimization efficiency compared to state-of-the-art methods.
14:05 CEST TS26.2 EFFECTIVE ANALOG ICS FLOORPLANNING WITH RELATIONAL GRAPH NEURAL NETWORKS AND REINFORCEMENT LEARNING
Speaker:
Davide Basso, University of Trieste, IT
Authors:
Davide Basso1, Luca Bortolussi1, Mirjana Videnovic-Misic2 and Husni Habal3
1University of Trieste, IT; 2Infineon Technologies AT, AT; 3Infineon Technologies, DE
Abstract
Analog integrated circuit (IC) floorplanning is typically a manual process with the placement of components (devices and modules) planned by a layout engineer. This process is further complicated by the interdependence of floorplanning and routing steps, numerous electric and layout-dependent constraints, as well as the high level of customization expected in analog design. This paper presents a novel automatic floorplanning algorithm based on reinforcement learning. It is augmented by a relational graph convolutional neural network model for encoding circuit features and positional constraints. The combination of these two machine learning methods enables knowledge transfer across different circuit designs with distinct topologies and constraints, increasing the generalization ability of the solution. Applied to 6 industrial circuits, our approach surpassed established floorplanning techniques in terms of speed, area and half-perimeter wire length. When integrated into a procedural generator for layout completion, overall layout time was reduced by 67.3% with a 8.3% mean area reduction compared to manual layout.
14:10 CEST TS26.3 FORMALLY VERIFYING ANALOG NEURAL NETWORKS WITH DEVICE MISMATCH VARIATIONS
Speaker:
Tobias Ladner, TU Munich, DE
Authors:
Yasmine Abu-Haeyeh1, Thomas Bartelsmeier2, Tobias Ladner3, Matthias Althoff3, Lars Hedrich4 and Markus Olbrich2
1University of Frankfurt, DE; 2Leibniz University Hannover, DE; 3TU Munich, DE; 4Goethe University Frankfurt, DE
Abstract
Training and running inference of large neural networks comes with excessive cost and power consumption. Thus, realizing these networks as analog circuits is an energy- and area-efficient alternative. However, analog neural networks suffer from inherent deviations within their circuits, requiring extensive testing for their correct behavior under these deviations. Unfortunately, tests based on Monte Carlo simulations are extremely time- and resource-intensive. We present an alternative approach to proving the correctness of the neural network using formal neural network verification techniques and developing a modeling methodology for these analog neural circuits. Our experimental results compare two methods based on reachability analysis showing their effectiveness by reducing the test time from days to milliseconds. Thus, they offer a faster, more scalable solution for verifying the correctness of analog neural circuits.
14:15 CEST TS26.4 POST-LAYOUT AUTOMATED OPTIMIZATION FOR CAPACITOR ARRAY IN DIGITAL-TO-TIME CONVERTER
Speaker:
Hefei Wang, Southern University of Science and Technology, CN
Authors:
Hefei Wang1, Jianghao Su1, Junhe Xue1, Haoran Lv1, Junhua Zhang2, Longyang Lin1, Kai Chen1, Lijuan Yang2 and Shenghua Zhou3
1Southern University of Science and Technology, CN; 2International Quantum Academy, CN; 3Southern University of Science and Technology; International Quantum Academy, CN
Abstract
The integral non-linearity (INL) of Digital-to-Time Converter (DTC) in fractional-N phase-locked loops introduces fractional spurs, especially at near-integer channels, resulting in increased jitter. To meet the strict jitter and spur performance requirements of high-performance wireless transceivers, minimizing the INL in DTC designs is crucial. This work presents a computer-aided, automated optimization methodology that focuses on addressing issues stemming from the uniform capacitor unit structure within the capacitor array in Variable-Slope DTC. These issues include parasitic resistance and capacitance, which distort the charging and discharging behavior of the capacitors, contributing to INL. By systematically optimizing the capacitor layout and mitigating parasitic effects, the methodology allows precise tuning of each capacitor unit in capacitor array to reduce INL, enhancing the overall performance of the DTC.
14:20 CEST TS26.5 TIME-DOMAIN 3D ELECTROMAGNETIC FIELDS ESTIMATION BASED ON PHYSICS-INFORMED DEEP LEARNING FRAMEWORK
Speaker:
Huifan Zhang, ShanghaiTech University, CN
Authors:
Huifan Zhang, Yun Hu and Pingqiang Zhou, ShanghaiTech University, CN
Abstract
Electromagnetic simulation is important and time-consuming in RF/microwave circuit design. Physics-informed deep learning is a promising method to learn a family of parametric partial differential equations. In this work, we propose a physics-informed deep learning framework to estimate time-domain 3D electromagnetic fields. Our method leverages physics-informed loss functions to model Maxwell's equations which govern electromagnetic fields. Our post-trained model produces accurate results with over 200x speedup over the FDTD simulation. We reduce the mean square error by at least 14% and 15%, with respect to purely data-driven learning and the Fourier operator learning method FNO. In order to optimize data and physical loss simultaneously, we introduce a self-adaptive scaling factors updating algorithm, which has 8.4% less error than the loss balancing method ReLoBRaLo.
14:25 CEST TS26.6 TPC-GAN: BATCH TOPOLOGY SYNTHESIS FOR PERFORMANCE-COMPLIANT OPERATIONAL AMPLIFIERS USING GENERATIVE ADVERSARIAL NETWORKS
Speaker:
Jinglin Han, Beihang University, CN
Authors:
Yuhao Leng1, Jinglin Han1, Yining Wang2 and Peng Wang1
1Beihang University, CN; 2Corelink Technology(Qingdao)Co.,Ltd., CN
Abstract
Operational amplifier is one of the most important analog basic blocks. Existing automated synthesis strategies for operational amplifiers solely focus on the optimization of single topology, making them unsuitable for scenarios requiring batch synthesis, such as dataset augmentation. In this paper, we introduce TPC-GAN, a generative model for batch topology synthesis of operational amplifiers in accordance with performance specifications. To be specific, it incorporates a reward network of circuit performance into the adversarial generative networks (GANs). This enables direct synthesis of novel and feasible circuit topology meeting performance specifications. Experimental results demonstrate that our proposed method can achieve a validity rate of 98% in circuit generation, among which 99.7% are novel relative to the training dataset. With the introduction of a reward network, a significant portion (82.8%) of the generated circuits satisfy performance specifications, which is a substantial improvement than those without. Transistor-level experimental results further demonstrate the practicality and competitiveness of our generated circuits with nearly 3x improvement over manual designs.
14:30 CEST TS26.7 NANOELECTROMECHANICAL BINARY COMPARATOR FOR EDGE-COMPUTING APPLICATIONS
Speaker:
Victor Marot, University of Bristol, GB
Authors:
Victor Marot, Manu Krishnan, Mukesh Kulsreshath, Elliott Worsey, Roshan Weerasekera and Dinesh Pamunuwa, University of Bristol, GB
Abstract
Bitwise comparison is a fundamental operation in many digital arithmetic functions and is ubiquitous in both datapath and control elements; for example, many machine learning algorithms depend on binary comparison. This work proposes a new class of binary comparator circuit using 4-terminal nanoelectromechanical (NEM) relays that use just 6 devices compared to 9 transistors in CMOS implementations. Moreover, NEM implementations are capable of withstanding much higher temperatures, up to 300°C, and radiation levels, well over 1 Mrad absorbed dose, conditions which are common across many industrial edge applications, with near zero standby power. A 1-bit magnitude and equality comparators comprising two in-plane silicon 4-terminal relays each were fabricated on a silicon-on-insulator substrate and electrically characterized for proof of concept, the first such demonstration. Using the 1-bit comparators as building blocks, a scalable tree-based topology is proposed to implement higher-order comparators, resulting in ≈47% reduction in device count over a CMOS implementation for a 64-bit comparator. Circuit level simulations of the comparator using accurate device models show that a single operation consumes at most 21 fJ a 9-fold reduction over the best CMOS offering in an equivalent process node.
14:35 CEST TS26.8 CLOCK AND POWER SUPPLY-AWARE HIGH ACCURACY PHASE INTERPOLATOR LAYOUT SYNTHESIS
Speaker:
Hung-Ming Chen, University of Tehran, IR
Authors:
Siou-Sian Lin1, Shih-Yu Chen1, Yu-Ping Huang1, Tzu-Chuan Lin1, Hung-Ming Chen2 and Wei-Zen Chen1
1NYCU, TW; 2National Yang Ming Chiao Tung University, TW
Abstract
Due to a popular request from the designers of clock and data recovery (CDR) in the inefficiency of generating high accuracy phase interpolator (PI), in this work, we have developed a layout generator for such circuit, different from conventional constraint-driven works. In the first stage, we propose a customized template floorplanning plus pin generation demanded by the users. In the second stage, in order to generate high accuracy layout, we implement a gridless router for signal, power supply and clock. Experiments with several configurations indicate that our approach can generate high-quality corresponding layouts that align with user expectations, and even surpass the quality of manual designs on structurally regular high-performance PIs, which are not easy and efficient to be generated by prior primitive/grid-based methods.
14:40 CEST TS26.9 ML-BASED FAST AND ACCURATE PERFORMANCE MODELING AND PREDICTION FOR HIGH-SPEED MEMORY INTERFACES ACROSS DIFFERENT TECHNOLOGIES
Speaker:
Taehoon Kim, Seoul National University, KR
Authors:
Taehoon Kim1, Minjeong Kim1, Hankyu Chi2, Byungjun Kang2, Eunji Song2 and Woo-Seok Choi1
1Seoul National University, KR; 2SK hynix, KR
Abstract
The chip industry is undergoing a market transition from mass production to mass customization. Rapid market changes require agile responses and diversified product designs, particularly in interface circuits managing chip-to-chip communication. To facilitate these shifts, this paper proposes a machine learning-based method for rapidly and accurately predicting and analyzing the performance of high-speed transceivers, along with an evaluation methodology utilizing the proposed approach. Especially, using the process technology information as input in the dataset, this is the first work to predict the performance of a design across different technologies, which will be invaluable in architecting and optimizing designs during the early stages of development. By simulating each functional block, we gather a dataset for parameterized design and performance and incorporate device characteristics from lookup tables. The transmitter, which operates like digital circuits, is trained using parameterized signals with a DNN, while the receiver, containing analog blocks and feedback structures, employs hybrid LSTM-DNN learning with time-series input and output. Our model, trained with a 40nm design, demonstrates high accuracy in predicting performance even with different foundries and technologies. The majority of performance parameters show an R^2 value exceeding 0.9, indicating strong predictive accuracy under varying conditions. This method provides valuable insights for early-stage design optimization and process technology scaling, offering potential for broader applications in other circuit design areas.
14:45 CEST TS26.10 ACCELERATING OTA CIRCUIT DESIGN: TRANSISTOR SIZING BASED ON A TRANSFORMER MODEL AND PRECOMPUTED LOOKUP TABLES
Speaker:
Subhadip Ghosh, University of Minnesota, US
Authors:
Subhadip Ghosh1, Endalk Gebru1, chandramouli kashyap2, Ramesh Harjani1 and Sachin S. Sapatnekar1
1University of Minnesota, US; 2Cadence Design Systems, US
Abstract
Device sizing is crucial for meeting performance specifications in operational transconductance amplifiers (OTAs), and this work proposes an automated sizing framework based on a transformer model. The approach first leverages the driving-point signal flow graph (DP-SFG) to map an OTA circuit and its specifications into transformer-friendly sequential data. A specialized tokenization approach is applied to the sequential data to expedite the training of the transformer on a diverse range of OTA topologies, under multiple specifications. Under specific performance constraints, the trained transformer model is used to accurately predict DP-SFG parameters in the inference phase. The predicted DP-SFG parameters are then translated to transistor sizes using a precomputed look-up table-based approach inspired by the gm/Id methodology. In contrast to previous conventional or machine-learning-based methods, the proposed framework achieves significant improvements in both speed and computational efficiency by reducing the need for expensive SPICE simulations within the optimization loop; instead, almost all SPICE simulations are confined to the one-time training phase. The method is validated on a variety of unseen specifications, and the sizing solution demonstrates over 90% success in meeting specifications with just one SPICE simulation for validation, and 100% success with 3-5 additional SPICE simulations.
14:50 CEST TS26.11 A 10PS-ORDER FLEXIBLE RESOLUTION TIME-TO-DIGITAL CONVERTER WITH LINEARITY CALIBRATION AND LEGACY FPGA
Speaker:
Kentaroh Katoh, Fukuoka University, JP
Authors:
Kentaroh Katoh1, Toru Nakura2 and Haruo Kobayashi3
1FUKUOKA UNIVERSITY, JP; 2Fukuoka University, JP; 3Gunma University, JP
Abstract
This paper presents a 10ps-order flexible resolution time-to-digital converter (TDC) consisting of only Lookup Tables and Flip-Flops that can be applied to legacy FPGAs, which is industry friendly. The proposed TDC is a Vernier delay-line based TDC. By using MUX chains as the delay adjustable buffers, it realizes flexible and high resolution 10ps-order TDC. By controlling the control values of each MUX chain independently, the nonlinearity of TDC is compensated. In the evaluation using the AMD Artix-7 FPGA, the DNL and INL were [-0.26 LSB, 0.91 LSB] and [-0.84 LSB, 2.27 LSB], respectively, at a resolution of 8.92 ps.

TS27 Design for On-Chip Interconnects

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST

Time Label Presentation Title
Authors
14:00 CEST TS27.1 HIPERNOC: A HIGH-PERFORMANCE NETWORK-ON-CHIP FOR FLEXIBLE AND SCALABLE FPGA-BASED SMARTNICS
Speaker:
Klajd Zyla, TU Munich, DE
Authors:
Klajd Zyla, Marco Liess, Thomas Wild and Andreas Herkersdorf, TU Munich, DE
Abstract
A recent approach that the research community has proposed to address the steep growth of network traffic and the attendant rise in computing demands is in-network computing. This paradigm shift is bringing about an increase in the types of computations performed by network devices. Consequently, processing demands are becoming more varied, requiring flexible packet-processing architectures. State-of-the-art switch-based smart network interface cards (SmartNICs) provide high versatility without sacrificing performance but do not scale well concerning resource usage. In this paper, we introduce HiPerNoC—a flexible and scalable field-programmable gate array (FPGA)-based SmartNIC architecture deploying a 2D-mesh network-on-chip (NoC) with a novel router design to manage network traffic with diverse processing demands. The NoC can forward incoming network packets to the available processing engines in the required sequence at a traffic load of up to 91.1 Gbit/s (0.89 flit/node/cycle). Each router applies distributed switch allocation and avoids head-of-line blocking by deploying queues at the switch crosspoints of input-output connections used by the routing algorithm. It also prevents deadlocks by employing non-blocking virtual cut-through switching. We implemented a prototype of HiPerNoC as a 4x4 2D-mesh NoC in SystemVerilog and evaluated it with synthetic network traffic via cycle-accurate register-transfer level simulations in Vivado. The evaluation results show that HiPerNoC achieves up to 53% higher saturation throughput, occupies 53 % fewer lookup tables and block RAMs, and consumes 16 % less power on an Alveo U55C than ProNoC—a state-of-the-art FPGA-based NoC.
14:05 CEST TS27.2 NEUROHEXA: A 2D/3D-SCALABLE MODEL-ADAPTIVE NOC ARCHITECTURE FOR NEUROMORPHIC COMPUTING
Speaker:
Yi Zhong, Peking University, CN
Authors:
Yi Zhong, Zilin Wang, Yipeng Gao, Xiaoxin Cui, Xing Zhang and Yuan Wang, Peking University, CN
Abstract
Neuromorphic computing has endeavored a novel computing paradigm that entails a bio-inspired architecture to reproduce the remarkable functionalities of the human brain, such as massively parallel processing and extremely low-power consumption. However, those promising merits can be greatly canceled by the mismatched communication infrastructure in large-scale hardware implementation, in view of the vast degree of neural connectivity, the unstructured spike dataflow, and the unbalanced model workload assignment. In an effort to tackle those challenges, this work presents NeuroHexa, a network-on-chip (NoC) architecture intended for multi-core neuromorphic design. NeuroHexa adopts a customized intra-chip hexagonal topology, which can be further cascaded in 6 directions by either 2D or 3D chiplet integration. Designed in globally asynchronous, locally synchronous (GALS) methodology, a group of processing nodes can operate in independent work pace to further improve resource utilization. To satisfy the varied requirement of data reuse across the chip, NeuroHexa proposes a flexible multicast routing mechanism to best adapt to the model-defined dataflow. And under a specific congestion scenario, NeuroHexa can switch its routing algorithm between deterministic routing and fully adaptive routing modes. The presented NoC router is evaluated in 28nm CMOS, where we achieve the maximal throughput as 179.2Gbps, and the best energy efficiency as 4.872pJ/packet at the area overhead of 0.0226mm2.
14:10 CEST TS27.3 SPB: TOWARDS LOW-LATENCY CXL MEMORY VIA SPECULATIVE PROTOCOL BYPASSING
Speaker:
Junbum Park, Sungkyunkwan University, KR
Authors:
Junbum Park, Yongho Lee, Sungbin Jang, Wonyoung Lee and Seokin Hong, Sungkyunkwan University, KR
Abstract
Compute Express Link (CXL) is an advanced interconnect standard designed to facilitate high-speed communication between CPUs, accelerators, and memory devices, making it well-suited for data-intensive applications such as machine learning and real-time analytics. Despite its advantages, CXL memory encounter significant latency challenges due to the complex hierarchy of protocol layers, which can adversely impact performance in latency-sensitive scenarios. To address this issue, we introduce the Speculative Protocol Bypassing (SPB) architecture, which aims to minimize latency during read operations by speculatively bypassing several protocol layers of CXL. To achieve this, SPB employs the Snooper mechanism, which extracts essential read commands from the Flit data at an early stage, allowing it to bypass multiple protocol layers and reduce memory access time. Additionally, the Hazard Filter (HF) prevents Read-After-Write (RAW) hazards between read and write operations, thereby maintaining data integrity and ensuring system reliability. The SPB architecture effectively optimizes CXL memory access latency, providing a robust solution for high-performance computing environments that require both low latency and high efficiency. Its minimal hardware overhead makes it a practical and scalable enhancement for future CXL-based memory.
14:15 CEST TS27.4 SRING: A SUB-RING CONSTRUCTION METHOD FOR APPLICATION-SPECIFIC WAVELENGTH-ROUTED OPTICAL NOCS
Speaker:
Zhidan Zheng, TU Munich, DE
Authors:
Zhidan Zheng, Meng Lian, Mengchu Li, Tsun-Ming Tseng and Ulf Schlichtmann, TU Munich, DE
Abstract
Wavelength-routed optical networks-on-chip (WRONoCs) attract ever-increasing attention for supporting high-speed communications with low power and latency. Among all WRONoC routers, optical ring routers attract much interest for their simple structures. However, current designs of ring routers have overlooked the customization problem: when adapting to applications that have specific communication requirements, current designs suffer high propagation loss caused by long worst-case signal paths and high splitter usage in power distribution networks (PDN). To address those problems, we propose a novel customization method to generate application-specific ring routers with multiple sub-rings, SRing. Instead of sequentially connecting all nodes in a large ring, we cluster the nodes and connect them with sub-ring waveguides to reduce the path length. Besides, we propose a mixed integer linear programming model for wavelength assignment to reduce the number of PDN splitters. We compare SRing to three state-of-the-art ring router design methods for six applications. Experimental results show that SRing can greatly reduce the length of the longest signal path, the worst-case insertion loss, and the number of splitters in the PDN, significantly improving the power efficiency.
14:20 CEST TS27.5 BEAM: A MULTI-CHANNEL OPTICAL INTERCONNECT FOR MULTI-GPU SYSTEMS
Speaker:
Chongyi Yang, Microelectronics Thrust, The Hong Kong University of Science and Technology (Guangzhou), CN
Authors:
Chongyi Yang1, Bohan Hu1, Peiyu Chen1, Yinyi Liu2, Wei Zhang2 and Jiang Xu1
1The Hong Kong University of Science and Technology (Guangzhou), CN; 2The Hong Kong University of Science and Technology, HK
Abstract
High-performance computing and AI applications necessitate high-bandwidth communication between GPUs. Traditional electrical interconnects for GPU-to-GPU communication face challenges over longer distances, including high power consumption, crosstalk noise, and signal loss. In contrast, optical interconnects excel in this domain, offering high bandwidth and consistent power dissipation over long distance. This paper proposes BEAM, a Bandwidth-Enhanced optical interconnect Architecture for Multi-GPU systems. BEAM extends electrical-optical interfaces into the GPU package, positioning them close to GPU compute logic and memory. Unlike existing single-channel approaches, each BEAM optical interface incorporates multiple parallel optical channels, further enhancing bandwidth. An arbitration scheme manages channel usage among data transfers. Evaluation on Rodinia benchmarks and LLM training kernels demonstrates that BEAM achieves a speedup of 1.14 - 1.9× and reduces energy consumption by 29 - 44% compared to the electrical-interconnect system and state-of-the-art schemes, while maintaining comparable chip area consumption.
14:25 CEST TS27.6 TCDM BURST ACCESS: BREAKING THE BANDWIDTH BARRIER IN SHARED-L1 RVV CLUSTERS BEYOND 1000 FPUS
Speaker:
Diyou Shen, ETH Zurich, CH
Authors:
Diyou Shen1, Yichao Zhang1, Marco Bertuletti1 and Luca Benini2
1ETH Zurich, CH; 2ETH Zurich, CH | Università di Bologna, IT
Abstract
As computing demand and memory footprint of deep learning applications accelerate, clusters of cores sharing local (L1) multi-banked memory are widely used as key building blocks in large-scale architectures. When the cluster's core count increases, a flat all-to-all interconnect between cores and L1 memory banks becomes a physical implementation bottleneck, and hierarchical network topologies are required. However, hierarchical, multi-level intra-cluster networks are subject to internal contention which may lead to significant performance degradation, especially for SIMD or vector cores, as their memory access is bursty. We present the TCDM Burst Access architecture, a software-transparent burst transaction support to improve bandwidth utilization in clusters with many vector cores tightly coupled to a multi-banked L1 data memory. In our solution, a Burst Manager dispatches burst requests to L1 memory banks, multiple 32b words from burst responses are retired in parallel on channels with parametric data-width. We validate our design on a RISC-V Vector (RVV) many-core cluster, evaluating the benefits on different core counts. With minimal logic area overhead (less than 8%), we improve the bandwidth of a 16-, a 256-, and a 1024-Floating Point Unit (FPU) baseline clusters, without Tightly Coupled Data Memory (TCDM) Burst Access, by 118%, 226%, and 77% respectively. Reaching up to 80% of the cores-memory peak bandwidth, our design demonstrates ultra-high bandwidth utilization and enables efficient performance scaling. Implemented in 12-nm FinFET technology node, compared to the serialized access baseline, our solution achieves up to 1.9x energy efficiency and 2.76x performance in real-world kernel benchmarkings.
14:30 CEST TS27.7 SEDG: STITCH-COMPATIBLE END-TO-END LAYOUT DECOMPOSITION BASED ON GRAPH NEURAL NETWORK
Speaker:
Yifan Guo, Shanghai Jiao Tong university, CN
Authors:
Yifan Guo1, Jiawei Chen1, Yexin Li1, Yunxiang Zhang1, Qing Zhang1, Yuhang Zhang2 and Yongfu Li1
1Shanghai Jiao Tong University, CN; 2East China Normal University, CN
Abstract
Advanced semiconductor lithography faces significant challenges as feature sizes continue to shrink, necessitating effective Multiple Patterning Layout Decomposition (MPLD) algorithms. Existing MPLD algorithms have limited efficiency or cannot support stitch insertion to achieve finer-grained optimal decomposition. This paper introduces an end-to-end GNN-based framework that not only achieves high-quality solutions quickly but also applies to layouts with stitches. Our framework treats layouts as heterogeneous graphs and performs inference through a message-passing mechanism. We deliver ultra-competitive, near-optimal solutions that are 10× faster than the exact algorithm (e.g., integer linear programming) and 3× faster than approximate algorithms (e.g., exact-cover, semi-definite programming).
14:35 CEST TS27.8 MULTISCALE FEATURE ATTENTION AND TRANSFORMER BASED CONGESTION PREDICTION FOR ROUTABILITY-DRIVEN FPGA MACRO PLACEMENT
Speaker:
Hao Gu, Southeast University, CN
Authors:
Hao Gu1, Xinglin Zheng1, Youwen Wang1, Keyu Peng1, Ziran Zhu2 and Yang Jun3
1Southeast University, CN; 2School of Integrated Circuits, Southeast University, CN; 3,
Abstract
As routability has emerged as a critical task in modern field-programmable gate array (FPGA) physical design, it is desirable to develop an effective congestion prediction model during the placement stage. Given that the interconnection congestion level is a critical metric for measuring the routability of FPGA placement, we utilize that level as the model training label. In this paper, we propose a multiscale feature attention (MFA) and transformer based congestion prediction model to extract placement features and strengthen their association with congested areas for effective FPGA macro placement. A convolutional neural network (CNN) component is first designed to extract multiscale features from grid-based placement. Then, a well-designed MFA block is proposed that utilizes the dual attention mechanism on both spatial and channel dimensions to enhance the representation of each multiscale feature. By incorporating MFA blocks and CNN's output at each skip connection layer, our model substantially enhances its capability to learn features and recover more precise congestion level maps. Furthermore, multiple transformer layers that employ dynamic attention mechanisms are utilized to extract global information, which can significantly improve the difference between various congestion levels and enhance the ability to identify these levels. Based on the ten most congested and challenging benchmarks from the MLCAD 2023 FPGA macro placement contest, experimental results show that our model outperforms existing congestion prediction models. Furthermore, our model can achieve the best routability and score among the contest winners when integrated into the macro placer based on DREAMPlaceFPGA.
14:40 CEST TS27.9 AN EFFECTIVE AND EFFICIENT CROSS-LINK INSERTION FOR NON-TREE CLOCK NETWORK SYNTHESIS
Speaker:
Mengshi Gong, Southwest University of Science and Technology, CN
Authors:
Jinghao Ding1, Jiazhi Wen1, Hao Tang1, Zhaoqi Fu1, Mengshi Gong1, Yuanrui Qi1, Wenxin Yu1 and Jinjia Zhou2
1Southwest University of Science and Technology, CN; 2Hosei University, JP
Abstract
Clock skew introduces significant challenge to the overall system performance. Existing non-tree solutions like cross-link insertion often come with limitations, such as the over-consumption of resource and power. In this work, we propose a cross-link insertion algorithm that effectively reduces the clock skew with minimal power overhead, and prioritize delay optimization on the paths with high sensitivity to the skew. The experimental results from the ISPD 2010 benchmarks show a 17% reduction in the mean of clock skew, a 45% decrease in the standard deviation of clock skew, and a 13% lower power consumption versus the advanced non-tree solutions in literature.

US01 Unplugged session

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST


W03 3rd Workshop on Nano Security: From Nano-Electronics to Secure Systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 18:00 CEST


LBR02 Late Breaking Results

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST


MPP03 Multi-Partner Projects

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST MPP03.1 MULTI-PARTNER PROJECT: ORCHESTRATING DEPLOYMENT AND REAL-TIME MONITORING - NEPHELE MULTI-CLOUD ECOSYSTEM
Speaker:
Manolis Katsaragakis, National TU Athens, GR
Authors:
Manolis Katsaragakis1, Orfeas Filippopoulos1, Christos Sad2, Dimosthenis Masouros1, Dimitrios Spatharakis1, Ioannis Dimolitsas1, Nikos Filinis1, Anastasios Zafeiropoulos1, Kostas Siozios3, Dimitrios Soudris1 and Symeon Papavassiliou1
1National TU Athens, GR; 2Department of Physics, Aristotle University of Thessaloniki,, GR; 3Aristotle University of Thessaloniki , GR
Abstract
The rapid growth of Internet of Things (IoT) devices and emerging technologies, along with the growing demands of edge-deployed applications, has led to a complex paradigm where computation often shifts dynamically acrooss the IoT- Edge-Cloud continuum. The NEPHELE project addresses these complexities by enabling seamless orchestration across a diverse spectrum of computing resources, spanning multi-cloud environ- ments to the far-Edge. In this paper, we present NEPHELE's multi-cloud infrastructure, built to overcome key orchestration challenges within cloud and edge environments. We discuss the core components and architectural decisions, focusing on multi- cluster resource orchestration mechanisms, integrated monitoring for local and multi-cloud systems, inter- and intra-cluster scaling, and networking capabilities. Experimental results demonstrate the efficiency of our infrastructure, highlighting overhead manage- ment in service deployment, migration, networking, and scaling scenarios, thus highlighting NEPHELE's robustness in handling distributed applications across heterogeneous environments
16:35 CEST MPP03.2 MULTI-PARTNER PROJECT: CYBERSECDOME - FRAMEWORK FOR SECURE, COLLABORATIVE, AND PRIVACY-AWARE INCIDENT HANDLING FOR DIGITAL INFRASTRUCTURE
Speaker:
Mohammad Hamad, TU Munich, DE
Authors:
Mohammad Hamad1, Michael Kuehr1, Haralambos Mouratidis2, Eleni-Maria Kalogeraki2, Christos Gizelis3, Dimitris Papanikas3, Athanasios Bountioukos-Spinaris4, Charilaos Skandylas5, Evangelos Raptis6, Andreas Alexopoulos6, Grigorios Chrysos7, Mina Marmpena8, Sevasti Politi8, Konstantinos Lieros8, Papagiannopoulos Nikolaos9, Iordanis Xanthopoulos10, Spyros Papastergiou11, Sotiris Ioannidis7, Mikael Asplund12, Marc-Oliver Pahl13 and Sebastian Steinhorst1
1TU Munich, DE; 2Security Labs Consulting, IE; 3Hellenic Telecommunications Organisation, GR; 4CyberAlytics Ltd., CY; 5Link�ping University, SE; 6Aegis IT Research, DE; 7TU Crete, GR; 8Information Technology for Market Leadership, GR; 9Athens International Airport S.A., GR; 10Sphynx, SZ; 11MAGGIOLI S.P.A., IT; 12Linköping University, SE; 13IMT Atlantique, FR
Abstract
Digital infrastructure is vital for the economy, democracy, and everyday life, yet it is becoming increasingly vulnerable to strategic cyber-attacks. These attacks can lead to significant digital disruptions, resulting in widespread service outages, financial losses, and a decline in public trust. Ensuring resilience is difficult due to the infrastructure's complexity, the large volume of data involved, and the growing need for quick, coordinated responses. In the EU Horizon project CyberSecDome, we propose a multi-layered framework that provides AI-driven solutions for incident detection and prediction, automated testing, risk assessment, and rapid incident response, supporting continuity amid complex, large-scale cyber threats. Additionally, CyberSecDome introduces a virtual reality interface to enhance AI model explainability and provide real-time contextual awareness of ongoing attacks and defense mechanisms. It also enables privacy-aware model sharing across AI systems, fostering secure collaboration among different systems.
16:40 CEST MPP03.3 MULTI-PARTNER PROJECT: ARCHITECTURES AND DESIGN METHODOLOGIES TO ACCELERATE AI WORKLOADS - THE ICSC FLAGSHIP 2 PROJECT
Speaker:
Cristina Silvano, Politecnico di Milano, IT
Authors:
Cristina Silvano1, Fabrizio Ferrandi1, Serena Curzel1, Daniele Ielmini2, Stefania Perri3, Fanny Spagnolo4, Pasquale Corsonello5, Sebastiano Schifano6, Cristian Zambelli7, Angelo Garofalo8, Luca Benini9 and Francesco Conti10
1Politecnico di Milano, IT; 2Poltecnico di Milano, IT; 3University of Calabria - DIMEG, IT; 4DIMES, University of Calabria, IT; 5University of Calabria, IT; 6University of Ferrara, IT; 7University of Ferrara, IT; 8University of Bologna, ETH Zurich, IT; 9ETH Zurich, CH | Università di Bologna, IT; 10Università di Bologna, IT
Abstract
Recent pre-exascale and exascale supercomputers have driven the development of increasingly sophisticated AI models for diverse applications, including image recognition and classification, natural language processing, and generative AI. These applications require specialized hardware accelerators, to handle the heavy computational demands of AI algorithms in an energy-efficient manner. Today, AI accelerators are deployed across various systems, from low-power edge devices to large-scale servers, high-performance computing (HPC) infrastructures, and data centers. The primary objective of the ICSC Flagship 2 project, discussed in this paper, is to develop heterogeneous hardware platforms optimized to accelerate HPC and big data applications. Specifically, this paper provides an overview of the key challenges addressed and the achievements realized at the current intermediate stage of the ICSC Flagship 2 project focused on architectures, technologies, and design methodologies to design efficient hardware accelerators for AI workloads, such as deep learning (DL) and transformer models.
16:45 CEST MPP03.4 MULTI-PARTNER PROJECT: A DATA SPACES ARCHITECTURE FOR ENHANCING GREEN AI SERVICES (GREEN.DAT.AI)
Speaker:
Ioannis Chrysakis, Netcompany-Intrasoft SA, LU
Authors:
Ioannis Chrysakis1, Evangelos Agorogiannis1, Nikoleta Tsampanaki1, Michalis Vourtzoumis1, Eva Chondrodima2, Yannis Theodoridis2, Domen Mongus3, Ben Capper4, Martin Wagner5, Aris Sotiropoulos6, Fábio Coelho7, Claudia Brito8, Panos Protopapas9, Despina Brasinika9, Ioanna Fergadiotou9 and Christos Doulkeridis2
1Netcompany-Intrasoft SA, LU; 2University of Piraeus, GR; 3University of Maribore, SI; 4Redhat, IE; 5Eviden, ES; 6AEGIS IT Research, DE; 7INESC TEC & Universidade do Minho, PT; 8INESC TEC, PT; 9Inlecom Innovation, GR
Abstract
The concept of data spaces has emerged as a structured, scalable solution to streamline and harmonize data sharing across established ecosystems. Simultaneously, the rise of AI services enhances the extraction of predictive insights, operational efficiency, and decision-making. Despite the potential of combining these two advancements, integration remains challenging: data spaces technology is still developing, and AI services require further refinement in areas like ML workflow orchestration and energy-efficient ML algorithms. In this paper, we introduce an integrated architectural framework, developed under the Green.Dat.AI project, that unifies the strengths of data spaces and AI to enable efficient, collaborative data sharing across sectors. A practical application is illustrated through a smart farming use case, showcasing how AI services within a data space can advance sustainable agricultural innovation. Integrating data spaces with AI services thus maximizes the value of decentralized data while enhancing efficiency through a powerful combination of data and AI capabilities.
16:50 CEST MPP03.5 MULTI-PARTNER PROJECT: SAFE, SECURE AND DEPENDABLE MULTI-UAV SYSTEMS FOR SEARCH AND RESCUE OPERATIONS
Speaker:
Maria Michael, University of Cyprus, CY
Authors:
Panagiota Nikolaou1, Antonis Savva2, Ioannis Sorokos3, Koorosh Aslansefat4, Sondess Missaoui5, Akram Naveed3, Daniel Hillen3, Marc Lorenz3, Martin Walker4, Manos Papoutsakis6, Simos Gerasimou5, Panayiotis Kolios7, Yiannis Papadopoulos4, Jan Reich3, Sotiris Ioannidis6 and Maria Michael8
1University of Central Lancashire, CY; 2KIOS Research and Innovation Center of Excellence and the Department of Electrical and Computer Engineering, CY; 3Fraunhofer IESE, ; 4University of Hull, ; 5University of York, ; 6Institute of Computer Science, Foundation for Research and Technology, Heraklion, Greece, ; 7Dept. of Computer Science, KIOS Centre of Excellence, University of Cyprus, ; 8KIOS Research and Innovation Center of Excellence and the Department of Electrical and Computer Engineering,
Abstract
Unmanned Aerial Vehicles (UAVs) have become essential in search and rescue operations, especially in disaster management scenarios. Their effective navigation and the integration of a plethora of sensors assist in efficient person detection, making them an essential technological tool to first responders. Multi-UAV systems extend these benefits by using coordinated strategies to cover large areas efficiently, reducing overall mission response time and enhancing its success. Despite these advantages, challenges remain in ensuring the safety, security, and dependability of (mutli-)UAV missions. Issues such as navigation risks, potential cyber threats, and hardware-/software-related reliability issues can impact the mission results. Additionally, UAVs are highly constrained devices with limited battery capacity, requiring the use of lightweight technologies. In this paper, we present part of the results of the SESAME project, an EU multi-partner project that aims to develop safe and secure multi-robot Systems. In particular, we present some of the developed SESAME Executable Digital Dependability Identities (EDDI) technologies based on Markov models, statistical distance measures, and other advanced approaches for enhancing safety, security and dependability of the UAV platform and underlying models. These EDDI technologies are seamlessly integrated using the ConSerts framework in a multi-UAV platform and tested using search and rescue scenarios. The results demonstrate significant improvements in multi-UAV safety, with an availability rate of 91% and a search and rescue algorithmic accuracy of 99.8%. Additionally, the system achieves precise detection of spoofing attacks, using collaborative localization as a mitigation technique to guide the UAV to a safe landing, even in the absence of GPS signals.
16:55 CEST MPP03.6 MULTI-PARTNER PROJECT: KEY ENABLING TECHNOLOGIES FOR COGNITIVE COMPUTING CONTINUUM - MYRTUS PROJECT PERSPECTIVE
Speaker:
Francesca Palumbo, Università degli Studi di Cagliari, IT
Authors:
Francesca Palumbo1, Francesco Ratto2, Claudio Rubattu3, Maria Katiuscia Zedda4, Tiziana Fanni4, Veena Rao5, Bart Driessen6 and Jeronimo Castrillon7
1University of Cagliari, IT; 2Università degli Studi di Cagliari, IT; 3University of Sassari, IT; 4Abinsula Srl, IT; 5HIRO microdatacenters, NL; 6TNO, NL; 7TU Dresden, DE
Abstract
The MYRTUS Horizon Europe project embraces the principles of the EU CloudEdgeIoT Initiative, integrating edge, fog, and cloud in a continuum of computing resources. MYRTUS intends to deliver abstractions, cognitive orchestration mechanisms, and a whole design environment to build and operate collaborative, distributed, heterogeneous systems. The goal is to provide high performance and play a crucial role in enabling energy efficiency and trustworthiness in nowadays systems.
17:00 CEST MPP03.7 MULTI-PARTNER PROJECT: BIM-POWERED ENVIRONMENTAL DATA AGENT FOR MORE RESILIENT AND TRUSTWORTHY DATA CENTERS
Speaker:
Oğuzhan Herkiloğlu, BİTNET Bilişim Hizmetleri, Ltd., TR
Authors:
Oğuzhan Herkiloğlu1, Ali Atalay2, İbrahim Arif3, Salih Ergün4 and Alper Kanak5
1Bitnet Bilişim Hizmetleri Ltd. Sti., TR; 2AI4SEC OÖ, EE; 3Ergünler R&D Ltd. Co., TR; 4Ergtech SP.Z.O.O, PL; 5Ergünler R&D Co. Ltd., TR
Abstract
This paper introduces an agent-based approach that semantically integrates the Building Information Model (BIM), Geographical Information System (GIS), and the Environmental Data Agent (EDA)-based optimization interface between Information and Operational Technology (IT/OT) for more trusted and resilient data centers. Using the cybersecurity-aware BIM-GIS-IoT data model facilitates the exchange of requirements and forecasts to optimize energy use, environmental impact, availability, and costs in data centers. At the core of this solution, the EDA securely mediates data exchange between IT and OT, translating IT resource consumption into energy metrics for effective optimization.
17:01 CEST MPP03.8 MULTI-PARTNER PROJECT: RESILIENT TSN NETWORKS (RESTSN)
Speaker:
Henia Rafik, cortAIx/Labs, THALES, FR
Authors:
Rafik Henia1 and Marc Boyer2
1Thales Research & Technology, FR; 2ONERA, FR
Abstract
TSN (Time-Sensitive Networking), an extension of Ethernet technology standardized by IEEE, appears to be a promising technology to unify traditional networks in modern vehicle systems. It offers several advantages, including a much higher bandwidth and enhanced determinism, efficiency, scalability, and interoperability. One notable advantage of TSN over traditional networks lies in its capability to support data traffic of different shape (e.g. synchronous control/command, asynchronous video…) within the same network infrastructure. This means that TSN can accommodate a diverse range of data traffic with varying levels of importance or urgency, e.g. in aircrafts from critical flight control systems to less time-sensitive passenger entertainment systems, all within a single network framework Its deployment in vehicles would therefore substantially improve communication systems and reduce operational costs. Transmitting critical and non-critical data over the same physical TSN network infrastructure will require maintaining reliable communication to ensure essential functionalities, such as braking systems or flight control. However, in the event of a network failure (due to hardware breakdown, software malfunction, or even a cyberattack), some of the communications are inevitably lost, potentially compromising the system's overall integrity. This type of issue can have severe consequences on the overall system reliability. The ResTSN project's objective is to enhance the resilience of TSN by allowing the network, in case of failure, to automatically and dynamically reconfigure itself, ensuring that its most critical functions continue to operate, even with reduced resources. Reconfiguring the TSN network allows isolating the malfunctioning components, thus preventing potential cascading failures. Additionally, reconfiguration ensures that the TSN network can be restored once the issue is resolved.

TS28 Test and Verification for Dependability

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS28.1 POLYNOMIAL FORMAL VERIFICATION OF SEQUENTIAL CIRCUITS USING WEIGHTED-AIGS
Speaker:
Mohamed Nadeem, University of Bremen, DE
Authors:
Mohamed Nadeem1, Chandan Jha1 and Rolf Drechsler2
1University of Bremen, DE; 2University of Bremen | DFKI, DE
Abstract
Ensuring the functional correctness of a digital system is achievable through formal verification. Despite the increased complexity of modern systems, formal verification still needs to be done in a reasonable time. Hence, Polynomial Formal Verification (PFV) techniques are being explored as they provide a guaranteed upper bound on the run time for verification. Recently, it was shown that combinational circuits characterized by a constant cutwidth can be verified in linear time using Answer Set Programming (ASP). However, most of the designs used in digital systems are sequential. Hence, in this paper, we propose a linear time formal verification approach using ASP for sequential circuits with constant cutwidth. We achieve this by proposing a new data structure called Weighted-And Inverter Graph (W-AIG). Unlike existing formal verification methods, we prove that our approach can verify any sequential circuit with a constant cutwidth in a linear time. Finally, we also implement our approach and experimentally show the results on a variety of sequential circuits like pipelined adders, serial adders, and shift registers to confirm our theoretical findings.
16:35 CEST TS28.2 WORD-LEVEL COUNTEREXAMPLE REDUCTION METHODS FOR HARDWARE VERIFICATION
Speaker:
Zhiyuan Yan, Microelectronics Thrust, The Hong Kong University of Science and Technology(Guangzhou), CN
Authors:
Zhiyuan Yan1 and Hongce Zhang2
1The Hong Kong University of Science and Technology(Guangzhou), CN; 2The Hong Kong University of Science and Technology (Guangzhou), CN
Abstract
Hardware verification is crucial to ensure the correctness in the logic design of digital circuits. The purpose of verification is to either find bugs or show their absence. Prior works mostly focus on the bug-finding process and have proposed a range of verification algorithms and techniques to be faster to reach a bug or conclude with a proof of correctness. However, for a human verification engineer, it also matters how to better analyze the counterexamples trace to understand the root cause of bugs. This kind of technique remains absent in word-level circuit analysis. In this paper, we investigate the counterexample reduction method. Given the existing techniques for the bit-level circuit model, we first extend current semantic analysis methods to the word-level counterexample reduction and then develop a more efficient word-level structural analysis approach. We compare the effectiveness and overhead of these methods on the hardware model-checking problems and show the usefulness of such analysis in applications including pivot input analysis, word-level model-checking and counterexample-guided abstraction refinement.
16:40 CEST TS28.3 ACCURATE AND EXTENSIBLE SYMBOLIC EXECUTION OF BINARY CODE BASED ON FORMAL ISA SEMANTICS
Speaker:
Sören Tempel, TU Braunschweig, DE
Authors:
Sören Tempel1, Tobias Brandt2, Christoph Lüth3, Christian Dietrich1 and Rolf Drechsler3
1TU Braunschweig, DE; 2Independent, DE; 3University of Bremen | DFKI, DE
Abstract
Symbolic execution is an SMT-based software verification and testing technique. Symbolic execution requires tracking performed computations during software simulation to reason about branches in the software under test. The prevailing approach on symbolic execution of binary code tracks computations by transforming the code to be tested to an architecture-independent IR and then symbolically executes this IR. However, the resulting IR must be semantically equivalent to the binary code, making this process complex and error-prone. The semantics of the binary code are specified by the targeted ISA, commonly given in natural language and requiring a manual implementation of the transformation to an IR. In recent years, the use of formal languages to describe ISA semantics in a machine-readable way has gained increased popularity. We investigate the utilization of such formal semantics for symbolic execution of binary code, achieving an accurate representation of instruction semantics. We present a prototype for the RISC-V ISA and conduct a case study to demonstrate that it can be easily extended to additional instructions. Furthermore, we perform an experimental comparison with prior work which resulted in the discovery of five previously unknown bugs in the ISA implementation of the popular IR-based symbolic executor angr.
16:45 CEST TS28.4 EFFICIENT SAT-BASED BOUNDED MODEL CHECKING OF EVOLVING SYSTEMS
Speaker:
Sophie Andrews, Stanford University, US
Authors:
Sophie Andrews, Matthew Sotoudeh and Clark Barrett, Stanford University, US
Abstract
SAT-based verification is a common technique used by industry practitioners to find bugs in computer systems. However, these systems are rarely designed in a single step: instead, designers repeatedly make small modifications, reverifying after each change. With current tools, this reverification step takes as long as a full, from-scratch verification, even if the design has only been modified slightly. We propose a novel SAT-based verification technique that performs significantly better than the naive approach in the setting of evolving systems. The key idea is to reuse information learned during the verification of earlier versions of the system to speed up the verification of later versions. We instantiate our technique in a bounded model checking tool for SystemVerilog code and apply it to a new benchmark set based on real edit history for a set of open source RISC-V cores. This new benchmark set is now publicly available for further research on verification of evolving systems. Our tool, PrediCore, significantly improves the time required to verify properties on later versions of the cores compared to the current state-of-the-art, verify-from-scratch approach.
16:50 CEST TS28.5 HIGH-THROUGHPUT SAT SAMPLING
Speaker:
Arash Ardakani, University of California, Berkeley, US
Authors:
Arash Ardakani1, Minwoo Kang1, Kevin He1, Qijing Huang2 and John Wawrzynek1
1University of California, Berkeley, US; 2NVIDIA Corp., US
Abstract
In this work, we present a novel technique for GPU-accelerated Boolean satisfiability (SAT) sampling. Unlike conventional sampling algorithms that directly operate on conjunctive normal form (CNF), our method transforms the logical constraints of SAT problems by factoring their CNF representations into simplified multi-level, multi-output Boolean functions. It then leverages gradient-based optimization to guide the search for a diverse set of valid solutions. Our method operates directly on the circuit structure of refactored SAT instances, reinterpreting the SAT problem as a supervised multi-output regression task. This differentiable technique enables independent bit-wise operations on each tensor element, allowing parallel execution of learning processes. As a result, we achieve GPU-accelerated sampling with significant runtime improvements ranging from $33.6 imes$ to $523.6 imes$ over state-of-the-art heuristic samplers. We demonstrate the superior performance of our sampling method through an extensive evaluation on $60$ instances from a public domain benchmark suite utilized in previous studies.
16:55 CEST TS28.6 SMT-BASED REPAIRING REAL-TIME TASK SPECIFICATIONS
Speaker:
Anand Yeolekar, TCS Research, IN
Authors:
Anand Yeolekar1, RAVINDRA METTA1 and Samarjit Chakraborty2
1TCS, IN; 2UNC Chapel Hill, US
Abstract
When addressing timing issues in real-time systems, approaches for systematic timing debugging and repair have been missing due to (i) Lack of available feedback: most timing analysis techniques, being closed-form analytical techniques, are unable to provide root cause information when a timing property is violated, which is critical for identifying an appropriate repair, and (ii) Pessimism in the analysis: existing schedulability analysis techniques tend to make worst case assumptions in the presence of non-determinism introduced by real-world factors such as release jitter, or sporadic tasks. To address this gap, we propose an SMT encoding of task runs for exact debugging of timing violations, and a procedure to iteratively repair a given task specification. We demonstrate the utility of this procedure by repairing example task sets scheduled under global non-preemptive earliest-deadline- first scheduling, a common choice for many safety-critical systems.
17:00 CEST TS28.7 HACHIFI: A LIGHTWEIGHT SOC ARCHITECTURE-INDEPENDENT FAULT-INJECTION FRAMEWORK FOR SEU IMPACT EVALUATION
Speaker:
Masanori Hashimoto, Kyoto University, JP
Authors:
Quan Cheng1, Wang Liao2, Ruilin Zhang1, Hao Yu3, Longyang Lin3 and Masanori Hashimoto1
1Kyoto University, JP; 2Kochi University of Technology, JP; 3Southern University of Science and Technology, CN
Abstract
Single-Event Upsets (SEUs), triggered by energetic particles, manifest as unexpected bit-flips in memory cells or registers, potentially causing significant anomalies in electronic devices. Driven by the needs of safety-critical applications, it is crucial to evaluate the reliability of these electronic devices before they are deployed. However, traditional reliability analysis techniques, such as irradiation experiments, are costly, while fault injection (FI) simulations often fail to provide full coverage and have limited effectiveness and accuracy. To address these issues, we introduce HachiFI, a lightweight, architecture-independent framework that automates fault injection with 100\% coverage via memory and scan-chain accesses and simulates the behavior of SEUs based on specific cross-sections. HachiFI supports configurable fault injection patterns for both system-level and module-level reliability analysis. Using HachiFI, we demonstrate a low hardware overhead (<2%) and a high match (R^2=0.984) between FI and irradiation experiments, verified on a 22nm edge-AI chip.
17:05 CEST TS28.8 ACCELERATING CELL-AWARE MODEL GENERATION FOR SEQUENTIAL CELLS USING GRAPH THEORY
Speaker:
Gianmarco Mongelli, LIRMM, FR
Authors:
Gianmarco Mongelli1, Eric Faehn2, Dylan Robins2, Patrick Girard3 and Arnaud Virazel3
1LIRMM and STMicroelectronics Crolles, FR; 2STMicroelectronics, FR; 3LIRMM, FR
Abstract
The Cell-Aware (CA) methodology has become essential to detect and diagnose manufacturing intra-cell defects in modern semiconductor technologies. It characterizes standard cells by creating a defect-detection matrix, which serves as a reference that maps stimuli to the specific defects they can detect. Its limitation is that CA approach needs a number of time-consuming analog simulations to create the matrix. In [1] a graph-based methodology able to reduce the number of simulations to perform, called Transistor Undetectable Defect eLiminator (TrUnDeL), was presented. TrUnDeL can identify undetectable stimulus/defect pairs that are then excluded from the analog simulations. However, its use is limited to combinational cells and does not offer any guidance on handling sequential cells, which are usually the most complex cells. In this paper we present a new version of TrUnDeL that supports sequential cells analysis. Experiments conducted on sequential cells from two standard cell industrial libraries demonstrate that the CA generation time is reduced by 30% without compromising accuracy.
17:10 CEST TS28.9 AN EFFICIENT PARALLEL FAULT SIMULATOR FOR FUNCTIONAL PATTERNS ON MULTI-CORE SYSTEMS
Speaker:
Xiaoze Lin, Institute of Computing Technology, Chinese Academy of Sciences, CN
Authors:
Xiaoze Lin1, Liyang Lai2, Huawei Li3, Biwei Xie3 and Xingquan Li4
1State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, CN; 2Shantou University, CN; 3Institute of Computing Technology, Chinese Academy of Sciences, CN; 4Peng Cheng Laboratory, CN
Abstract
Fault simulation targeting functional patterns emerges as an essential mechanism within functional safety, crucial for validating the effectiveness of safety mechanisms. The acceleration of fault simulation for functional patterns is imperative for boosting the efficiency and adaptability of functional safety verification, presenting a significant yet unresolved challenge. In the paper, we propose an efficient fault simulator for functional patterns, utilizing three techniques including fault filtering, fault grouping, and CPU-based parallelism. The integration of these three techniques, tailored to the characteristics of functional patterns, reduces the runtime of fault simulation from different perspectives. The experimental results show that on a 48-core system, an average 79x speedup can be achieved by our parallel fault simulator against a commercial tool.
17:15 CEST TS28.10 SPATIAL MODELING WITH AUTOMATED MACHINE LEARNING AND GAUSSIAN PROCESS REGRESSION TECHNIQUES FOR IMPUTING WAFER ACCEPTANCE TEST DATA
Speaker:
Ming-Chun Wei, National Cheng Kung University, TW
Authors:
Ming-Chun Wei, Hsun-Ping Hsieh and Chun-Wei Shen, National Cheng Kung University, TW
Abstract
The Wafer Acceptance Test (WAT) is a significant quality control measurement in the semiconductor industry. However, because the WAT process can be time-consuming and expensive, sampling test is commonly employed during production. This makes root cause tracing impossible when abnormal products have not been tested. Therefore, in our study, we focus on establishing a reliable method to estimate WAT results for non-tested shots, including both intra and inter-wafer prediction. Notably, we are the first to combine the use of Chip Probing data with WAT to improve the predictions. Our proposed method first extracts valuable features from Chip Probing test results by using the Automated Machine Learning technique. We then employ Gaussian Process Regression to capture the spatio-temporal correlation. Finally, we adopted the linear regression model to ensemble two components and proposed a SMART-WAT model to effectively estimate the wafer acceptance test data. Our method has been tested on a real-world dataset from the semiconductor manufacturing industry. The prediction results of four key WAT parameters indicate that our proposed model outperforms the state-of-the-art methods in both intra and inter-wafer prediction.
17:20 CEST TS28.11 ON THE IMPACT OF WARPAGE ON BEOL GEOMETRY AND PATH DELAYS IN FAN-OUT WAFER-LEVEL PACKAGING
Speaker:
Dhruv Thapar, Arizona State University, US
Authors:
Dhruv Thapar1, Arjun Chaudhuri1, Christopher Bailey1, Ravi Mahajan2 and Krishnendu Chakrabarty1
1Arizona State University, US; 2Intel Corporation, US
Abstract
Warpage is a major concern in fan-out wafer-level packaging (FOWLP) due to the complex thermal processing steps involved in manufacturing. These steps include curing, electroplating, and deposition, which induce residual stresses through differential thermal expansion and contraction of materials. This effect is further amplified by mismatches in the coefficients of thermal expansion (CTE) between different materials. In particular, high-density interconnects in the back-end of line (BEOL), redistribution layers (RDLs), and through-mold vias (TMVs) are susceptible to warpage-induced stress, strain, and deformation. This work conducts structural simulations to analyze warpage in the BEOL stack induced by FOWLP. Our results indicate that the impact of warpage is non-uniform across the entire BEOL geometry of a die, hence it impacts different metal layers differently, and different coordinates within one metal layer differently. We leverage this warpage analysis to calculate parasitics and evaluate the resulting changes in path delays
17:21 CEST TS28.12 MODELING AND ANALYSIS TECHNIQUE FOR THE FORMAL VERIFICATION OF SYSTEM-ON-CHIP ADDRESS MAPS
Speaker:
Niels Mook, NXP, NL
Authors:
Niels Mook1, Erwin de Kock1, Bas Arts1, Soham Chakraborty2 and Arie van Deursen2
1NXP Semiconductors, NL; 2TU Delft, NL
Abstract
This paper proposes a modeling and analysis technique to verify SoC address maps. The approach involves (i) modeling the specification and implementation address map using a unified graph model, and (ii) analysis of equivalence in terms of address maps between two such models. Using a state-of-the-art mid-size SoC design, we demonstrate the proposed solution is able to analyze and verify address maps of complex SoC designs and to identify the causes of discrepancies.
17:22 CEST TS28.13 FREDDY: MODULAR AND EFFICIENT FRAMEWORK TO ENGINEER DECISION DIAGRAMS YOURSELF
Speaker:
Rune Krauss, DFKI, DE
Authors:
Rune Krauss1, Jan Zielasko1 and Rolf Drechsler2
1DFKI, DE; 2University of Bremen | DFKI, DE
Abstract
The hardware complexity in electronic devices used by today's society has increased significantly in recent decades due to technological progress. In order to cope with this complexity, data structures and algorithms in electronic design automation must be continuously improved. Decision Diagrams (DDs) are an important data structure in the design and analysis of circuits because they allow efficient algorithms for their manipulation. The practical relevance of DDs leads to an ongoing quest for appropriate software solutions that enable working with different DD types. Unfortunately, existing DD software libraries focus either on efficiency or usability. Consequences are a disproportionately high effort for extensions or considerable loss of performance. To tackle these issues, a modular and efficient Framework to Engineer Decision Diagrams Yourself (FrEDDY) is proposed in this paper. Various experiments demonstrate that no compromise with regard to performance has to be made when using FrEDDY. It is on par with or clearly more efficient than established DD libraries.

TS29 Approximate Computing Solutions

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS29.1 EFFICIENT APPROXIMATE LOGIC SYNTHESIS WITH DUAL-PHASE ITERATIVE FRAMEWORK
Speaker:
Ruicheng Dai, Shanghai Jiao Tong University, CN
Authors:
Ruicheng Dai1, Xuan Wang1, Wenhui Liang1, Xiaolong Shen2, Menghui Xu2, Leibin Ni2, Gezi Li2 and Weikang Qian1
1Shanghai Jiao Tong University, CN; 2Huawei Technologies Co., Ltd., China, CN
Abstract
Approximate computing is an emerging paradigm to improve the energy efficiency for error-tolerant applications. Many iterative approximate logic synthesis (ALS) methods were proposed to automatically design approximate circuits. However, as the sizes of circuits grow, the runtime of ALS grows rapidly. Thus, a crucial challenge is to ensure circuit quality while improving the efficiency of ALS. This work proposes a dual-phase iterative framework to accelerate the iterative ALS flows. In the first phase, a comprehensive circuit analysis is performed to gather the necessary information, including the error information. In the second phase, minimal incremental computation is employed based on the information from the first phase. The experimental results show that the proposed method achieves an acceleration by up to 21.8× without loss of circuit quality compared to the state-of-the-art methods.
16:35 CEST TS29.2 EFFICIENT APPROXIMATE NEAREST NEIGHBOR SEARCH VIA DATA-ADAPTIVEPARAMETER ADJUSTMENT IN HIERARCHICAL NAVIGABLE SMALL GRAPHS
Speaker:
Huijun Jin, Yonsei University, KR
Authors:
Huijun Jin, Jieun Lee, Shengmin Piao, Sangmin Seo, Sein Kwon and Sanghyun Park, Yonsei University, KR
Abstract
Abstract—Hierarchical Navigable Small World (HNSW) graphs are a state-of-the-art solution for approximate nearest neighbor search, widely applied in areas like recommendation systems, computer vision, and natural language processing. However, the effectiveness of the HNSW algorithm is constrained by its reliance on static parameter settings, which do not account for variations in data density and dimensionality across different datasets. This paper introduces Dynamic HNSW, an adaptive method that dynamically adjusts key parameters — such as the M (number of connections per node) and ef (search depth) — based on both local data density and dimensionality of the dataset. The proposed approach improves flexibility and efficiency, allowing the graph to adapt to diverse data characteristics. Experimental results across multiple datasets demonstrate that Dynamic HNSW significantly reduces graph build time by up to 33.11% and memory usage by up to 32.44%, while maintaining comparable recall, thereby outperforming the conventional HNSW in both scalability and efficiency. Keywords—Approximate Nearest Neighbor Search, Hierarchical Navigable Small World, Dynamic Parameter Tuning, Data-adaptive
16:40 CEST TS29.3 HAAN: A HOLISTIC APPROACH FOR ACCELERATING LAYER NORMALIZATION IN LARGE LANGUAGE MODELS
Speaker:
Sai Qian Zhang, New York University, US
Authors:
Tianfan Peng1, Tianhua Xia2, Jiajun Qin3 and Sai Qian Zhang4
1Tongji University, CN; 2Independent Researcher, US; 3Zhejiang University, CN; 4New York University, US
Abstract
Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated architectural components aimed at improving training stability, convergence speed, and generalization capabilities. Among these components, normalization operation, such as layer normalization (LayerNorm), emerges as a pivotal technique, offering substantial benefits to the overall model performance. However, previous studies have indicated that normalization operations can substantially elevate processing latency and energy usage.In this work, we adopt the principles of algorithm and hardware co-design, introducing a holistic normalization accelerating method named HAAN. The evaluation results demonstrate that HAAN can achieve significantly better hardware performance compared to state-of-the-art solutions.
16:45 CEST TS29.4 MCTA: A MULTI-STAGE CO-OPTIMIZED TRANSFORMER ACCELERATOR WITH ENERGY-EFFICIENT DYNAMIC SPARSE OPTIMIZATION
Speaker:
Heng Liu, Harbin Institute of Technology, CN
Authors:
Heng Liu, Ming Han, Jin Wu, Ye Wang and Jian Dong, Harbin Institute of Technology, CN
Abstract
As Transformer-based models continue to enhance service quality across various domains, their intensive computational requirements are exacerbating the AI energy crisis. Traditional energy-efficient Transformer architectures primarily focus on optimizing the Attention stage due to its high algorithmic complexity (O(n^2)). However, linear layers can also be significant energy consumers, sometimes accounting for over 70% of total energy usage. Although existing approaches such as sparsity have improved the Attention stage, the optimization space within such linear layers is not fully exploited. In this paper, we introduce the multi-stage co-optimized Transformer accelerator (MCTA) for optimizing energy efficiency. Our approach independently enhances the Query-Key-Value generation, Attention, and Feed-forward Neural Network stages. It employs two novel techniques: Low-overhead Mask Generation (LMG) for dynamically identifying unimportant calculations with minimal energy costs, and Cascaded Mask Derivation (CMD) for streamlining the mask generation process through parallel processing. Experimental results show that MCTA achieves an average energy reduction of 1.48× with only a 1% accuracy loss compared to state-of-the-art accelerators. This work demonstrates the potential for significant energy savings in Transformer models without the need for retraining, paving the way for more sustainable AI applications.
16:50 CEST TS29.5 CIRCUITS IN A BOX: COMPUTING HIGH-DIMENSIONAL PERFORMANCE SPACES FOR ANALOG INTEGRATED CIRCUITS
Speaker:
Juergen Kampe, Ernst-Abbe-Hochschule Jena, DE
Authors:
Benedikt Ohse, Jürgen Kampe and Christopher Schneider, Ernst-Abbe-Hochschule Jena, DE
Abstract
Performance spaces contain information about all combinations of attainable performance parameters of analog integrated circuits. Their exploration allows designers to evaluate given circuits without considering implementation details, making them a valuable tool to support the design process. The computation of performance spaces---even for a small number of considered parameters---is time-consuming because it requires solving multi-objective, non-convex optimization problems that involve costly circuit simulations. We present a numerical method for efficiently approximating high-dimensional performance spaces, which is based on the box-coverage method known from Pareto optimization. The resulting implementation not only outperforms state-of-the-art solvers based on the well-known Normal-Boundary Intersection method in terms of computational complexity, but also offers several advantages, such as a practical stopping criterion and the possibility of warm starting. Furthermore, we present an interactive visualization technique to explore performance spaces of any dimension, which can help system designers to make reliable topology decisions even without detailed technical knowledge of the underlying circuits. Numerical experiments that confirm the efficiency of our approach are performed by computing seven-dimensional performance spaces for an analog low-dropout regulator as used in the radio-frequency identification domain.
16:55 CEST TS29.6 GRADIENT APPROXIMATION OF APPROXIMATE MULTIPLIERS FOR HIGH-ACCURACY DEEP NEURAL NETWORK RETRAINING
Speaker:
Chang Meng, EPFL, Switzerland, CN
Authors:
Chang Meng1, Wayne Burleson2, Weikang Qian3 and Giovanni De Micheli1
1EPFL, CH; 2U Massachusetts Amherst, US; 3Shanghai Jiao Tong University, CN
Abstract
Approximate multipliers (AppMults) are widely employed in deep neural network (DNN) accelerators to reduce the area, delay, and power consumption. However, the inaccuracies of AppMults degrade DNN accuracy, necessitating a retraining process to recover accuracy. A critical step in retraining is computing the gradient of the AppMult, i.e., the partial derivative of the approximate product with respect to each input operand. Conventional methods approximate this gradient using that of the accurate multiplier (AccMult), often leading to suboptimal retraining results, especially for AppMults with relatively large errors. To address this issue, we propose a difference-based gradient approximation of AppMults to improve retraining accuracy. Experimental results show that compared to the state-of-the-art methods, our method improves the DNN accuracy after retraining by 4.10% and 2.93% on average for the VGG and ResNet models, respectively. Moreover, after retraining a ResNet18 model using the 7-bit AppMult, the final DNN accuracy does not degrade compared to the quantized model using the 7-bit AccMult, while the power consumption is reduced by 51%.
17:00 CEST TS29.7 SEGMENT-WISE ACCUMULATION: LOW-ERROR LOGARITHMIC DOMAIN COMPUTING FOR EFFICIENT LARGE LANGUAGE MODEL INFERENCE
Speaker:
Xinkuang Geng, Shanghai Jiao Tong University, CN
Authors:
Xinkuang Geng, Yunjie Lu, Hui Wang and Honglan Jiang, Shanghai Jiao Tong University, CN
Abstract
Logarithmic domain computing (LDC) has great potential for reducing quantization errors and computational complexity in Large Language Models (LLMs). While logarithmic multiplication can be efficiently implemented using fixed-point addition, the primary challenge in multiply-accumulate (MAC) operations is balancing the precision of logarithmic adders with their hardware overhead. Through a detailed analysis of the errors inherent in LDC-based LLMs, we propose segment-wise accumulation (SWA) to mitigate these errors. In addition, a processing element (PE) is introduced to enable SWA in the systolic array architecture. Compared with the accumulation scheme devised for enhancing floating-point computing, the proposed SWA facilitates the integration into existing accelerator architectures, resulting in lower hardware overhead. The experimental results show that SWA allows LDC under low-precision configurations to achieve remarkable accuracy in LLMs, demonstrating higher hardware efficiency than merely increasing the precision of individual computations. Our method, while maintaining a lower hardware overhead than traditional LDC, achieves more than 13.9% improvement in average accuracy across multiple zero-shot benchmarks in extsc{Llama-2-7B}. Furthermore, compared to integer domain computing, a logarithmic processing element array based on the proposed SWA yields reductions of 24.6% in area and 42.3% in power, while achieving higher accuracy.
17:05 CEST TS29.8 LOOKUP TABLE REFACTORING: TOWARDS EFFICIENT LOGARITHMIC NUMBER SYSTEM ADDITION FOR LARGE LANGUAGE MODELS
Speaker:
Xinkuang Geng, Shanghai Jiao Tong University, CN
Authors:
Xinkuang Geng1, Siting Liu2, Hui Wang1, Jie Han3 and Honglan Jiang1
1Shanghai Jiao Tong University, CN; 2ShanghaiTech University, CN; 3University of Alberta, CA
Abstract
Compared to integer quantization, logarithmic quantization aligns more effectively with the long-tailed distribution of data in large language models (LLMs), resulting in lower quantization errors. Moreover, the logarithmic number system (LNS) employs a fixed-point adder to perform multiplication, indicating a potential reduction in computational complexity for LLM accelerators that require extensive multiply-accumulate (MAC) operations. However, a key bottleneck is that LNS addition requires complex nonlinear functions, which are typically approximated using lookup tables (LUTs). This study aims to reduce the hardware resources needed for LUTs in LNS addition while maintaining high precision. Specifically, we investigate the specific nature of addition operations within LLMs; the relationship between the hardware parameters of the LUT and the computing errors is then mathematically derived. Based on these insights, we propose LUT refactoring to optimize the LUT for enhanced efficiency in LNS addition. With 10.93% and 19.78% reductions in area-delay product (ADP) and power-delay product (PDP), respectively, LUT refactoring results in an accuracy improvement of up to 33.5% in LLM benchmarks compared to the naive design. When compared to integer quantization, our method achieves higher accuracy while reducing area by 18.27% and power by 42.61%.
17:10 CEST TS29.9 EVASION: EFFICIENT KV CACHE COMPRESSION VIA PRODUCT QUANTIZATION
Speaker:
Zongwu Wang, Shanghai Jiao Tong University, CN
Authors:
Zongwu Wang1, Fangxin Liu1, Peng Xu1, Qingxiao Sun2, Junping Zhao3 and Li Jiang1
1Shanghai Jiao Tong University, CN; 2China University of Petroleum, Beijing, CN; 3Ant Group, CN
Abstract
Large language models (LLMs) benefit from longer context lengths, but suffer from quadratic complexity in terms of attention mechanisms. KV caching alleviates this issue by storing pre-computed data, but its memory requirements increase linearly with context length, thereby hindering the intelligent development of LLMs. The traditional weight quantization scheme performs poorly in KV quantization due to two reasons: (1) KV requires dynamic quantization and de-quantization, which can lead to significant performance degradation; (2) Outliers are widely present in KV, which poses a challenge to low-bitwidth uniform quantization. This work proposes a novel approach called archname to achieve low-bitwidth quantization by product quantization. We thorough analyze the distribution of KV cache and demonstrate the limitations of existing quantization schemes. Then a non-uniform quantization algorithm based on product quantization is introduced, which offers efficient compression while maintaining accuracy. Finally, we design a high-performance GPU inference framework for archname, utilizing sparse computation and asynchronous quantization for further acceleration. Comprehensive evaluation results demonstrate that archname can achieve 4 bits quantization trivial perplexity and accuracy loss, it also achieves 1.8x end-to-end inference speedup.
17:11 CEST TS29.10 SOFTEX: A LOW POWER AND FLEXIBLE SOFTMAX ACCELERATOR WITH FAST APPROXIMATE EXPONENTIATION
Speaker:
Andrea Belano, University of Bologna, Bologna, Italy, IT
Authors:
Andrea Belano1, Yvan Tortorella1, Angelo Garofalo2, Davide Rossi1, Luca Benini3 and Francesco Conti1
1Università di Bologna, IT; 2University of Bologna, ETH Zurich, IT; 3ETH Zurich, CH | Università di Bologna, IT
Abstract
Transformer-based models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. Despite Transformers being computationally dominated by matrix multiplications (MatMul), a non-negligible portion of their runtime is also spent on executing the softmax operator. The softmax is a non-linear and non-pointwise operator that can become a performance bottleneck especially if dedicated hardware is used to decrease the runtime of MatMul operators. We introduce SoftEx, a parametric accelerator for the softmax function of BF16 vectors. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121× speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). We integrate our design in a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM and 8 general-purpose RISC-V cores as well as a 24×8 systolic array MatMul accelerator. In 12nm technology, SoftEx occupies 0.033 mm², only 2.75% of the cluster, and achieves an operating frequency of 1.12 GHz. Computing the attention probabilities with SoftEx requires up to 10.8× less time and 26.8× less energy compared to a highly optimized software implementation running on the 8 cores, boosting the overall throughput on MobileBERT's attention layer by up to 2.17×, achieving a performance of 324 GOPS at 0.80V or 1.30 TOPS/W at 0.55V at full BF16 accuracy.

TS30 System Level Design and Test, Modeling and Verification

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS30.1 ERASER: EFFICIENT RTL FAULT SIMULATION FRAMEWORK WITH TRIMMED EXECUTION REDUNDANCY
Speaker:
Jiaping Tang, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN
Authors:
Jiaping Tang1, Jianan Mu1, Silin Liu1, Zizhen Liu1, Feng Gu2, Xinyu Zhang1, Leyan Wang1, Shengwen Liang2, Jing Ye1, Huawei Li1 and Xiaowei Li3
1State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences/ CASTEST, China, CN; 2State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences, CN; 3State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences, China, CN
Abstract
As intelligent computing devices increasingly integrate into human life, ensuring the functional safety of the corresponding electronic chips becomes more critical. A key metric for functional safety is achieving a sufficient fault coverage. To meet this requirement, extensive time-consuming fault simulation of the RTL code is necessary during the chip design phase. The main overhead in RTL fault simulation comes from simulating behavioral nodes (always blocks). Due to the limited fault propagation capacity, fault simulation results often match the good simulation results for many behavioral nodes. A key strategy for accelerating RTL fault simulation is the identification and elimination of redundant simulations. Existing methods detect redundant executions by examining whether the fault inputs to each RTL node are consistent with the good inputs. However, we observe that this input comparison mechanism overlooks a significant amount of implicit redundant execution: although the fault inputs differ from the good inputs, the node's execution results remain unchanged. Our experiments reveal that this overlooked redundant execution constitutes nearly half of the total execution overhead of behavioral nodes, becoming a significant bottleneck in current RTL fault simulation. The underlying reason for this overlooked redundancy is that, in these cases, the true execution paths within the behavioral nodes are not affected by the changes in input values. In this work, we propose a behavior-level redundancy detection algorithm that focuses on the true execution paths. Building on the elimination of redundant executions, we further developed an efficient RTL fault simulation framework, Eraser. Experimental results show that compared to commercial tools, under the same fault coverage, our framework achieves a 3.9 × improvement in simulation performance on average.
16:35 CEST TS30.2 PESEC -- A SIMPLE POWER-EFFICIENT SINGLE ERROR CORRECTING CODING SCHEME FOR RRAM
Speaker:
Shlomo Engelberg, Jerusalem College of Technology, IL
Authors:
Shlomo Engelberg1 and Osnat Keren2
1Jerusalem College of Technology, IL; 2Bar-Ilan University, IL
Abstract
The power consumed when writing to Resistive Random Access Memory (RRAM) is significantly greater than that consumed by many charge-based memories such as SRAM, DRAM and NAND-Flash memories. As a result, when used in applications where instantaneous power consumption is constrained, the number of bits that can be set or reset must not exceed a certain threshold. In this paper, we present a power-efficient, single error correcting (PESEC) code for memory macros, which, when combined with bus encoding, ensures low-power operation and reliable data storage. This systematic, multiple-representation based single-error correcting code provides a relatively high rate, with a marginal increase in implementation cost relative to that of a standard Hamming code, and it can be used with any bus encoder.
16:40 CEST TS30.3 FROM GATES TO SDCS: UNDERSTANDING FAULT PROPAGATION THROUGH THE COMPUTE STACK
Speaker:
Odysseas Chatzopoulos, University of Athens, GR
Authors:
Odysseas Chatzopoulos1, George Papadimitriou1, Dimitris Gizopoulos1, Harish Dixit2 and Sriram Sankar2
1University of Athens, GR; 2Meta Platforms Inc., US
Abstract
Silent Data Corruption (SDC) is the most severe effect of a silicon defect in a CPU or other computing chip. The arithmetic units of a CPU are, usually, unprotected and are, thus, the ones that most likely produce SDCs (as well as visible malfunctions of programs such as crashes). In this work, we shed light on the traversal of silicon defects from their point of origin deep inside arithmetic units of complex CPUs towards the program result. We employ microarchitecture-level fault injection enhanced with gate-level designs of the arithmetic units of interest. The hybrid setup combines (i) the accuracy of the hardware and fault modeling and (ii) the speed of program simulation to run long programs to end (thus observing SDC incidents); the analysis that this combination delivers is impossible at other abstraction layers which are either hardware-agnostic (software level) or extremely slow (gate-level). We quantify the effects of faults in two stages and with multiple metrics: (a) how faults propagate to the outputs of the arithmetic units when individual instructions are executed, and (b) how faults eventually affect the outcome of the program generating SDCs, crashes, or being masked. Our fine-grain findings can be utilized for informed fault detection and tolerance strategies at the hardware or the software levels.
16:45 CEST TS30.4 RAPID FAULT INJECTION SIMULATION BY HASH-BASED DIFFERENTIAL FAULT EFFECT EQUIVALENCE CHECKS
Speaker:
Johannes Geier, TU Munich, DE
Authors:
Johannes Geier1, Leonidas Kontopoulos1, Daniel Mueller-Gritschneder2 and Ulf Schlichtmann1
1TU Munich, DE; 2TU Wien, AT
Abstract
Assessing a computational system's resilience to hardware faults is essential for safety and security-related systems. Fault Injection (FI) simulation is a valuable tool that can increase confidence in computational systems and guide hardware and software design decisions in the early stages of development. However, simulating hardware at low levels of abstraction, such as Register Transfer Level (RTL), is costly, and minimizing the effort required for large-scale FI campaigns is a significant objective. This work introduces Hash-based Differential Fault Effect Equivalence Checks to automatically terminate experiments early based on predicting their outcome. We achieve this by matching observed fault effects to ones already encountered in previous experiments. We generate these hashes from differentials computed by repurposing existing fast boot checkpoints from a state-of-the-art acceleration method. By integrating these approaches in an automated manner, we can accelerate a large-scale FI simulation of a CPU at RTL. We reduce the average simulation time by a factor of up to 25 compared to a factor of around 2 to 5 for state-of-the-art techniques. While maintaining 100 % accuracy, we can recover the faulty state through the stored differentials.
16:50 CEST TS30.5 DEAR: DEPENDABLE 3D ARCHITECTURE FOR ROBUST DNN TRAINING
Speaker:
Ashish Reddy Bommana, Arizona State University, US
Authors:
Ashish Reddy Bommana1, Farshad Firouzi2, Chukwufumnanya Ogbogu3, Biresh Kumar Joardar4, Janardhan Rao Doppa3, Partha Pratim Pande3 and Krishnendu Chakrabarty1
1Arizona State University, US; 2ASU, US; 3Washington State University, US; 4University of Houston, US
Abstract
ReRAM-based compute-in-memory (CiM) architectures present an attractive design choice for accelerating deep neural network (DNN) training. However, these architectures are susceptible to stuck-at faults (SAFs) in ReRAM cells, which arise from manufacturing defects and cell wearout over time, particularly due to the continuous weight updates during DNN training. These faults significantly degrade accuracy and compromise dependability. To address this issue, we propose DEAR: dependable 3D architecture for robust DNN training. DEAR introduces a novel online compensation method that employs a digital compensation unit to correct SAF-induced errors dynamically during both forward and backward propagation. This approach mitigates errors induced by SAFs during both the forward and backward phases of DNN training. Additionally, DEAR leverages an HBM-based 3D memory structure to store fault-related error information efficiently. Experimental results show that DEAR limits inferencing accuracy loss to under 2% even when up to 10% of cells are faulty with uniformly distributed faults, and under 2% for up to 5% faulty cells in clustered distributions. This high fault tolerance is achieved with an area overhead of 11.5% and energy overhead of less than 6% for VGG networks and less than 12% for ResNet networks.
16:55 CEST TS30.6 IMPROVING SOFTWARE RELIABILITY WITH RUST: IMPLEMENTATION FOR ENHANCED CONTROL FLOW CHECKING METHODS
Speaker:
Jacopo Sini, Politecnico di Torino, IT
Authors:
Jacopo Sini1, Mohammadreza Amel Solouki1, Massimo Violante1 and Giorgio Di Natale2
1Politecnico di Torino, IT; 2TIMA - CNRS, FR
Abstract
The C language, traditionally used in developing safety-critical systems, often faces memory management issues, leading to potential vulnerabilities. Rust emerges as a safer and secure alternative, aiming to mitigate these risks with its robust memory protection features, making it suitable for producing reliable code in critical environments, such as the automotive industry. This study proposes employing Rust code hardened by Control Flow Checking (CFC) in real-time embedded systems, which software is traditionally developed by Assembly and C languages. The methods have been implemented at the application level, i.e., in the Rust source code, to make them platform-agnostic. A methodology for leveraging the Rust advantages is presented, such as stronger security guarantees and modern features, to implement these methods more effectively. Highlighting a use case in the automotive sector, our research demonstrates the Rust capacity to enhance system reliability through CFC, especially against Random Hardware Faults. Two CFC algorithms from the literature, YACCA, and RACFED, have been implemented in the Rust language to assess their effectiveness, obtaining 46.5\% Diagnostic Coverage for the YACCA method and 50.1\% for RACFED. The proposed approach is aligned with functional safety standards, showcasing how Rust can balance safety requirements and cost considerations in industries reliant on software solutions for critical functionalities.
17:00 CEST TS30.7 BRIDGING THE GAP BETWEEN ANOMALY DETECTION AND RUNTIME VERIFICATION: H-CLASSIFIERS
Speaker:
Hagen Heermann, RPTU Kaiserlautern, DE
Authors:
Hagen Heermann and Christoph Grimm, University of Kaiserslautern-Landau, DE
Abstract
Runtime Verification (RV) and Anomaly Detection (AD) are crucial for ensuring the reliability of cyber-physical systems, but existing methods often suffer from high computational costs and lack of explainability. This paper presents a novel approach that integrates formal methods into anomaly detection, transforming complex system models into efficient classification tasks. By combining the strengths of RV and AD, our method significantly improves detection efficiency while providing explainability for failure causes. Our approach offers a promising solution for enhancing the safety and reliability of critical systems.
17:05 CEST TS30.8 CRITICALITY AND REQUIREMENT AWARE HETEROGENEOUS COHERENCE FOR MIXED CRITICALITY SYSTEMS
Speaker:
Mohamed Hassan, McMaster University, CA
Authors:
Safin Bayes and Mohamed Hassan, McMaster University, CA
Abstract
We propose CoHoRT, as the first heterogeneous cache coherent solution for mixed criticality systems (MCS) equipped with several features that targets the characteristics and requirements of such systems. CoHoRT is requirementaware. It provides an optimization engine to optimally configure the architecture based on system requirements. CoHoRT is also criticality-aware. It introduces a low-cost novel architecture to enable cores to heterogeneously run different coherence protocols (time-based and MSI-based protocols). Moreover, it enables a run-time switch between these protocols to provide hardware support for mode operation switch, which is a common challenge in MCS. Our evaluation shows that CoHoRT outperforms existing solutions both from worst-case memory latency as well as overall average performance. It also illustrates that CoHoRT is able to meet timing requirements in various MCS setups and showcases CoHoRT's ability to adapt to mode switches.
17:10 CEST TS30.9 PROTECTING CYBER-PHYSICAL SYSTEMS VIA VENDOR-CONSTRAINED SECURITY AUDITING WITH REINFORCEMENT LEARNING
Speaker:
Nan Wang, East China University of Science and Technology, CN
Authors:
Nan Wang1, Kai Li1, Lijun Lu1, Zhiwei Zhao1 and Zhiyuan Ma2
1School of Information Science and Engineering, East China University of Science and Technology, CN; 2Institute of Machine Intelligence, University of Shanghai for Science and Technology, CN
Abstract
Hardware Trojans may cause security issues in cyber-physical systems (CPSs), and recently proposed mutual auditing frameworks have helped build trustworthy CPSs with untrustworthy devices by requiring neighboring devices from different vendors. However, this may cause severe multi-vendor integration challenges, such as expensive, hard-to-maintain, and insufficient vendors to purchase devices. In this work, we improve the mutual auditing framework by maintaining the security of the CPSs with fewer vendors. First, the vendor-constrained security auditing framework is introduced to enhance the security of the CPS network with limited vendors, where side-auditing detects the hardware Trojan collusion between neighboring nodes and infected node isolation stops the spread of active HTs. Second, a multi-agent cooperative reinforcement learning-based method is proposed to assign devices with proper vendors in the context of security auditing, and it provides solutions with a minimized number of offline nodes due to the HT infection. The experimental results show that our proposed method reduces the number of vendors needed by 40.95%, and only causes an increment of 0.39% infected nodes.
17:15 CEST TS30.10 ADAPTIVE BRANCH-AND-BOUND TREE EXPLORATION FOR NEURAL NETWORK VERIFICATION
Speaker:
Kota Fukuda, Kyushu University, JP
Authors:
Kota Fukuda1, Guanqin Zhang2, Zhenya Zhang1, Yulei Sui2 and Jianjun Zhao1
1Kyushu University, JP; 2University of New South Wales, AU
Abstract
Formal verification is a rigorous approach that can provably ensure the quality of neural networks, and to date, Branch and Bound (BaB) is the state-of-the-art that performs verification by splitting the problem as needed and applying off-the-shelf verifiers to sub-problems for improved performance. However, existing BaB may not be efficient, due to its naive way of exploring the space of sub-problems that ignores the importance of different sub-problems. To bridge this gap, we first introduce a notion of importance that reflects how likely a counterexample can be found with a sub-problem, and then we devise a novel verification approach, called ABONN, that explores the sub-problem space of BaB adaptively, in a Monte-Carlo tree search (MCTS) style. The exploration is guided by the importance of different sub-problems, so it favors the sub-problems that are more likely to find counterexamples. As soon as it finds a counterexample, it can immediately terminate; even though it cannot find, after visiting all the sub-problems, it can still manage to verify the problem. We evaluate ABONN with 552 verification problems from commonly-used datasets and neural network models, and compare it with the state-of-the-art verifiers as baseline approaches. Experimental evaluation shows that ABONN demonstrates speedups of up to 15.2x on MNIST and 24.7x on CIFAR-10. We further study the influences of hyperparameters to the performance of ABONN, and the effectiveness of our adaptive tree exploration.
17:20 CEST TS30.11 TOWARDS COHERENT SEMANTICS: A QUANTITATIVELY TYPED EDSL FOR SYNCHRONOUS SYSTEM DESIGN
Speaker:
Rui Chen, KTH Royal Institute of Technology, SE
Authors:
Rui Chen and Ingo Sander, KTH Royal Institute of Technology, SE
Abstract
We present SynQ, an embedded DSL (EDSL) targeting synchronous system design with quantitative types. SynQ is designed to facilitate semantically coherent system design processes by language embedding and advanced type systems. The current case study indicates the potential for a seamless system design process.
17:21 CEST TS30.12 CO-DESIGN OF SUSTAINABLE EMBEDDED SYSTEMS-ON-CHIP
Speaker:
Dominik Walter, FAU, DE
Authors:
Jan Spieck, Dominik Walter, Jan Waschkeit and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE
Abstract
This paper introduces a novel approach to the co-design of sustainable embedded systems through multi-objective design space exploration (DSE). We propose a two-phase methodology that optimizes both the multiprocessor system-on-chip (MPSoC) architecture and application mappings, considering sustainability, reliability, performance, and cost as optimization objectives. Our method thereby accounts for both operational and embodied emissions, providing a more comprehensive assessment of sustainability. First, an individual intra-application DSE is performed to explore Pareto-optimal constraint graphs for each application. The second phase, an inter-application DSE, combines these results to explore sustainable target architectures and corresponding application mappings. Our approach incorporates detailed models for embodied emissions (scope 1 and scope 2), operational emissions, reliability, performance, and cost. The evaluation demonstrates that our sustainability-aware DSE is able to explore design spaces, supported by superior results in four key objectives. This enables the development of sustainable embedded systems whilst achieving high performance and reliability.

TS31 Emerging Design Technologies

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST

Time Label Presentation Title
Authors
16:30 CEST TS31.1 GENETIC ALGORITHM-DRIVEN IMC MAPPING FOR CNNS USING MIXED QUANTIZATION AND MLC FEFETS
Speaker:
Alptekin Vardar, Fraunhofer IPMS, DE
Authors:
Alptekin Vardar, Franz Müller, Gonzalo Cuñarro Podestá, Nellie Laleni, Nandakishor Yadav and Thomas Kämpfe, Fraunhofer IPMS, DE
Abstract
Ferroelectric Field-Effect Transistors (FeFETs) are emerging as a highly promising non-volatile memory (NVM) technology for in-memory computing architectures, thanks to their low power consumption and non-volatility. These characteristics make FeFETs particularly well-suited for convolutional neural networks (CNNs), especially in power-constrained environments where minimizing the memory footprint is critical for improving both area efficiency and energy consumption. Two effective strategies for reducing memory requirements are quantization and the use of multi-level cell (MLC) configurations in NVMs. This work proposes a solution that combines mixed quantization schemes with FeFET-based MLC and single-level cell (SLC) configurations to balance memory usage and accuracy. Given the large hyperparameter space introduced by these combinations, we employ a genetic algorithm to efficiently explore and identify Pareto-optimal solutions, allowing flexible adaptation to various application-specific requirements. Our approach achieves significant improvements in both memory efficiency and performance, reducing memory usage by 50% while sacrificing only 3% accuracy compared to the 8-bit ResNet baseline. After a single epoch of retraining, the accuracy matches the baseline while fully retaining the memory savings. Additionally, when compared to the 4-bit baseline, a 46% memory reduction is achieved with virtually no loss in accuracy.
16:35 CEST TS31.2 OPENMFDA: MICROFLUIDIC DESIGN AUTOMATION IN THREE DIMENSIONS
Speaker:
Ashton Snelgrove, University of Utah, US
Authors:
Ashton Snelgrove1, Daniel Wakeham1, Skylar Stockham1, Scott Temple2 and Pierre-Emmanuel Gaillardon1
1University of Utah, US; 2Primis AI, US
Abstract
Current microfluidic design automation (MFDA) solutions are limited by the planarity requirements of current manufacturing techniques. Recent advances in stereolithography 3D printing create an opportunity for new MFDA design methodologies.We propose a methodology for the placement of microfluidic components and the routing of flow and control channels in three dimensions. Additionally, we propose a methodology for generating a printable 3D structure from the layout. We then present OpenMFDA, an open-source MFDA design flow implementing the proposed methodologies. This design flow takes a structural netlist and produces a sliced design for manufacturing using an SLA 3D printer. Our methodology demonstrates short run times and generates devices with 2-20$ imes$ smaller area compared to state-of-the-art MFDA tools.
16:40 CEST TS31.3 CLAIRE: COMPOSABLE CHIPLET LIBRARIES FOR AI INFERENCE
Speaker:
Pragnya Nalla, University of Minnesota Twin Cities, US
Authors:
Pragnya Nalla1, Emad Haque2, Yaotian Liu2, Sachin S. Sapatnekar1, Jeff Zhang2, Chaitali Chakrabarti2 and Yu Cao1
1University of Minnesota, US; 2Arizona State University, US
Abstract
Artificial intelligence has made a significant impact on fields like computer vision, Natural Language Processing (NLP), healthcare, and robotics. However, recent AI models, such as GPT-4 and LLaMAv3, demand significant number of computational resources, pushing monolithic chips to their technological and practical limits. 2.5D chiplet-based heterogeneous architectures have been proposed to address these technological and practical limits. While chiplet optimization for models like Convolutional Neural Networks (CNNs) is well-established, scaling this approach to accommodate diverse AI inference models with different computing primitives, data volumes, and different chiplet sizes is very challenging. A set of hardened IPs and chiplet libraries optimized for a broad range of AI applications is proposed in this work. We derive the set of chiplet configurations that are composable, scalable and reusable by employing an analytical framework trained on a diverse set of AI algorithms. Testing these set of library synthesized configurations on a different set of algorithms, we achieve a 1.99× − 3.99× improvement in non-recurring engineering (NRE) chiplet design costs, with minimal performance overhead compared to custom chiplet-based ASIC designs. Similar to soft IPs for SoC development, the library of chiplets improves flexibility, reusability, and efficiency for AI hardware designs.
16:45 CEST TS31.4 A TALE OF TWO SIDES OF WAFER: PHYSICAL IMPLEMENTATION AND BLOCK-LEVEL PPA ON FLIP FET WITH DUAL-SIDED SIGNALS
Speaker:
Haoran Lu, Peking University, CN
Authors:
Haoran Lu, Xun Jiang, Yanbang Chu, Ziqiao Xu, Rui Guo, Wanyue Peng, Yibo Lin, Runsheng Wang, Heng Wu and Ru Huang, Peking University, CN
Abstract
As the conventional scaling of logic devices comes to an end, functional wafer backside and 3D transistor stacking are consensus for next-generation logic technology, offering considerable design space extension for powers, signals or even devices on the wafer backside. The Flip FET (FFET), a novel transistor architecture combining 3D transistor stacking and fully functional wafer backside, was recently proposed. With symmetric dual-sided standard cell design, the FFET can deliver around 12.5% cell area scaling and faster but more energy-efficient libraries beyond other stacked transistor technologies such as Complementary FET (CFET). Besides, thanks to the novel cell design with dual-sided pins, the FFET supports dual-sided signal routing, delivering better routability and larger backside design space. In this work, we demonstrated a comprehensive FFET evaluation framework considering physical implementation and block-level power-performance-area (PPA) assessment for the first time, in which key functions are dual-sided routing and dual-sided RC extraction. A 32-bit RISC-V core was used for the evaluation here. Compared to the CFET with single-sided signals, the FFET with single-sided signals (for fair comparison) achieved 23.3% post-P&R core area reduction, 25.0% higher frequency and 11.9% lower power at the same utilization, and 16.0 % higher frequency at the same core area. Meanwhile, the FFET supports dual-sided signals, which can further benefit more from flexible allocation of cell input pins on both sides. By optimizing the input pin density and BEOL routing layer number on each side, 10.6% frequency gain was realized without power degradation compared to the one with single-sided signal routing. Moreover, the routability and power efficiency of FFET barely degrades even with the routing layer number reduced from 12 to 5 on each side, validating the great space for cost-friendly design enabled by FFET.
16:50 CEST TS31.5 COLUMN-WISE QUANTIZATION OF WEIGHTS AND PARTIAL SUMS FOR ACCURATE AND EFFICIENT COMPUTE-IN-MEMORY ACCELERATORS
Speaker:
Kang Eun Jeon, Sungkyunkwan University, KR
Authors:
Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim and Jong Hwan Ko, Sungkyunkwan University, KR
Abstract
Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, imposed by cell limitations and the need for multiple cells for higher-bit weights, present further challenges. While fine-grained partial-sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums efficiently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robustness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant.
16:55 CEST TS31.6 DHD: DOUBLE HARD DECISION DECODING SCHEME FOR NAND FLASH MEMORY
Speaker:
Lanlan Cui, Xian University of Technology, CN
Authors:
Lanlan Cui1, Yichuan Wang1, Renzhi Xiao2, Miao Li3, Xiaoxue Liu1 and Xinhong Hei1
1Xi'an University of Technology, CN; 2Jiangxi University of Science and Technology, CN; 3National University of Defense Technology, CN
Abstract
With the advancement of NAND flash technology, the increased storage density leads to intensified interference, which in turn raises the error rate during data retrieval. To ensure data reliability, low-density parity-check (LDPC) codes are extensively employed for error correction in NAND flash memory. Although LDPC soft decision decoding offers high error correction capability, it comes with a significant latency. Conversely, hard-decision decoding, although faster, lacks sufficient error correction strength. Consequently, flash memory typically initiates with hard-decision decoding and resorts to multiple soft decision decoding upon failure. To minimize decoding latency, this paper proposes a decoding mechanism based on the double hard decision, called DHD. This DHD scheme improves the Log-Likelihood Ratio (LLR) in the hard decision process. After the first hard decision fails, the read reference voltage (RRV) is adjusted to perform the second hard decision decoding. If the second hard decision also fails, soft decision decoding is then employed. Experimental results demonstrate that when the Raw Bit Error Rate (RBER) is 8.5E-3, DHD reduces the Frame Error Rate (FER) by 86.4% compared to the traditional method.
17:00 CEST TS31.7 WRITE-OPTIMIZED PERSISTENT HASH INDEX FOR NON-VOLATILE MEMORY
Speaker:
Renzhi Xiao, Jiangxi University of Science and Technology, CN
Authors:
Renzhi Xiao1, Dan Feng2, Yuchong Hu2, Yucheng Zhang2, Lanlan Cui3 and Lin Wang2
1Jiangxi University of Science and Technology, CN; 2Huazhong University of Science and Technology, CN; 3Xi'an University of Technology, CN
Abstract
A hashing index provides rapid search performance by swiftly locating key-value items. Non-volatile memory (NVM) technologies have driven research into hashing indexes for NVM, combining hard disk persistence with DRAM-level performance. Nevertheless, current NVM-based hashing indexes must tackle data inconsistency challenges caused by NVM write reordering or partial writes, and mitigate rapid local wear due to frequent updates, considering NVM's limited endurance. The temporary allocation of buckets in NVM-based chained hashing to resolve hash collisions prolongs the critical path for writing, thus hampering write performance. This paper presents WOPHI, a write-optimized persistent hash index scheme for NVM. By utilizing log-free failure-atomic writes, WOPHI minimizes data consistency overhead and addresses hash conflicts with bucket pre-allocation. Experimental results underscore WOPHI's significant performance enhancements, with insertion latency slashed by up to 88.2\% and deletion latency boosted by up to 82.6\% compared to existing state-of-the-art schemes. Moreover, WOPHI substantially mitigates data consistency overhead, reducing cache line flushes by 59.3\%, while maintaining robust write throughput for insert and delete operations.
17:05 CEST TS31.8 DEAR-PIM: PROCESSING-IN-MEMORY ARCHITECTURE WITH DISAGGREGATED EXECUTION OF ALL-BANK REQUESTS
Speaker:
Jungi Hyun, Seoul National University, KR
Authors:
Jungi Hyun, Minseok Seo, Seongho Jeong, Hyuk-Jae Lee and Xuan Truong Nguyen, Seoul National University, KR
Abstract
Emerging transformer-based large language models (LLMs) involve many low-arithmetic intensity operations, which result in sub-optimal performance on general-purpose CPUs and GPUs. Processing-in-Memory (PIM) has shown promise in enhancing performance by reducing data movement bottlenecks. Commodity near-bank PIMs enable in-memory computation through bank-level compute units and typically rely on all-bank commands, which simultaneously operate the compute units of all banks to maximize internal bandwidth and parallelism. However, activating all banks simultaneously before issuing all-bank commands generally requires high peak power, which may exceed system power limit, when stacking multiple PIM devices for LLM inference. Additionally, under a DRAM power constraint, all-bank commands are only issued after all banks are fully activated through a sequence of single-bank activations, incurring bubble cycles and degrading overall performance. To address these shortcomings, this study proposes DEAR-PIM, a novel PIM architecture with Disaggregated Execution of All-bank Requests. DEAR-PIM incorporates disaggregated command queue, allowing it to buffer all-bank commands and provide them to each bank sequentially without waiting to complete all-bank activations. However, since all banks must finish their disaggregated execution before simultaneous post-processing, synchronization between early-activated and last-activated banks is necessary. To tackle the issue, DEAR-PIM introduces a column-aware synchronization command scheme that inserts no-op-like commands into unused columns without modifying the memory controller. Experiments demonstrate that DEAR-PIM achieves a speedup of 2.03-3.33× over an A100 GPU and improves performance by 1.11-1.52× compared to the sequential activation scheme. DEAR-PIM also reduces the peak power consumption by 21.3-41.7% compared to the simultaneous activation scheme.
17:10 CEST TS31.9 SYNDCIM: A PERFORMANCE-AWARE DIGITAL COMPUTING-IN-MEMORY COMPILER WITH MULTI-SPEC-ORIENTED SUBCIRCUIT SYNTHESIS
Speaker:
Kunming Shao, The Hong Kong University of Science and Technology, HK
Authors:
Kunming Shao1, Fengshi Tian1, Xiaomeng WANG1, Jiakun Zheng1, Jia Chen2, Jingyu He1, Hui Wu3, Jinbo Chen3, Xihao Guan1, Yi Deng2, Fengbin Tu1, Jie Yang3, Mohamad Sawan3, Tim Cheng1 and Chi Ying Tsui1
1The Hong Kong University of Science and Technology, HK; 2AI Chip Center for Emerging Smart Systems (ACCESS),Hong Kong University of Science and Technology, HK; 3Westlake University, CN
Abstract
Digital Computing-in-Memory (DCIM) is an innovative technology that integrates multiply-accumulation (MAC) logic directly into memory arrays to enhance the performance of modern AI computing. However, the need for customized memory cells and logic components currently necessitates significant manual effort in DCIM design. Existing tools for facilitating DCIM macro designs struggle to optimize subcircuit synthesis to meet user-defined performance criteria, thereby limiting the potential system-level acceleration that DCIM can offer. To address these challenges and enable the agile design of DCIM macros with optimal architectures, we present SynDCIM — a performance-aware DCIM compiler that employs multi-spec-oriented subcircuit synthesis. SynDCIM features an automated performance-to-layout generation process that aligns with user-defined performance expectations. This is supported by a scalable subcircuit library and a multi-spec-oriented searching algorithm for effective subcircuit synthesis. The effectiveness of SynDCIM is demonstrated through extensive experiments and validated with a test chip fabricated in a 40nm CMOS process. Testing results reveal that designs generated by SynDCIM exhibit competitive performance when compared to state-of-the-art manually designed DCIM macros.

US02 Unplugged session

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST


CC Closing Ceremony

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 18:00 CEST - 18:30 CEST