Programme | DATE 2025

DATE 2025 Detailed Programme

The detailed programme of DATE 2025 will continuously be updated.

More information on ASD Initiative, Keynotes, Tutorials, Workshops, Young People Programme

Navigate to Monday, 31 March 2025 | Tuesday, 01 April 2025 | Wednesday, 02 April 2025.

Monday, 31 March 2025

OC Opening Ceremony

Date: Monday, 31 March 2025
Time: 08:30 CEST - 09:00 CEST
Location / Room: Auditorium Lumière

Time	Label	Presentation Title Authors
08:30 CEST	OC.1	WELCOME ADDRESSES Presenter: Aida Todri-Sanial, Eindhoven University of Technology, NL Authors: Aida Todri-Sanial¹ and Theocharis Theocharides² ¹Eindhoven University of Technology, NL; ²University of Cyprus, CY Abstract Welcome Addresses from the Chairs
08:45 CEST	OC.2	PRESENTATION OF AWARDS Presenter: Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Jürgen Teich¹, Robert Wille², L. Miguel Silveira³ and Yervant Zorian⁴ ¹Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; ²TU Munich, DE; ³INESC-ID \| Universidade de Lisboa, PT; ⁴Synopsys, US Abstract Presentation of Awards

OK01 Opening Keynote 1

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 09:00 CEST - 09:45 CEST
Location / Room: Auditorium Lumière

Time	Label	Presentation Title Authors
09:00 CEST	OK01.1	TOWARDS GREENER ELECTRONICS AND A 1000X GAIN IN ENERGY EFFICIENCY: CO-OPTIMIZING INNOVATIVE IC ARCHITECTURES, DISRUPTIVE CMOS TECHNOLOGIES AND NEW EDA TOOLS Presenter: Jean-René Lèquepeys, CEA-Leti, FR Author: Jean-René Lèquepeys, CEA-Leti, FR Abstract Semiconductors and chips are ever-present in our current digital world. From smart sensors and the industrial Internet of Things to Digital Cities, personalized Medecine, Precision Agriculture, Vehicle Automation, and Cloud & High Performance Computing, semiconductor applications cover a very wide spectrum of society's needs. However, global warming is highlighting the social and environmental impact of the digital transition, and the complex trade-offs and choices that lie ahead if we are to build a sustainable world. How do we pursue digitalization taking into account a limited power budget and the planetary limits? How do we make greener choices in the face of ever-increasing/aggressive competition? How do we choose the right digital performance for each application instead of a one-size-fits-all scenario, with a best performance for all approach? The semiconductor ecosystem is indeed facing a difficult dilemma with complex key tradeoffs. With these stakes clearly in mind, the semiconductor community is performing disruptive research to provide greener electronics, able to attain very large gains in energy efficiency and just the right performance for each application. With the help of AI-boosted design methodologies and CAD tools, we have set out to co-optimize innovative CMOS technologies, disruptive chip architectures, computing models with new algorithms for embedded software. This paper will provide an overview of the global semiconductor landscape and the challenge of mastering the data deluge for the entire semiconductor ecosystem. In order to face this challenge, we must all work together to reduce the collection, transport and storage of fruitless data. This keynote will spend some time describing recent results from CEA-Leti and CEA-List's research on sustainable and greener technologies. To conclude, I will present an overview of the European Chips Act initiative, with the launch of the pilot lines, the Design Platforms and Competence Centers, a pan-European program that will be driving key milestones in the next five years to accelerate the accomplishment of our common goal of a sustainable and sovereign digital Europe.

OK02 Opening Keynote 2

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 09:45 CEST - 10:30 CEST
Location / Room: Auditorium Lumière

Time	Label	Presentation Title Authors
09:45 CEST	OK02.1	A VISION OF SYSTEMS AND TECHNOLOGY IN A CONNECTED EUROPE Presenter: Giovanni De Micheli, EPFL, CH Author: Giovanni De Micheli, EPFL, CH Abstract The unprecedented growth of electronic system applications, from AI to smart products, creates both a huge market opportunity and a deep need for talented engineers. Europe will play a dominant role in the thirties if we (i.e., our community) can set up the premises for such a technology expansion now. Whereas the European Chip Act is an important enabler, finance represents only one of the necessary conditions for success. The key aspect is the ability to leverage diverse competences and connect the partially-untapped energies of the various European players, ranging from Industry to Academia. Europe's strength stems from diversity and the ability to design complex systems from parts, possibly coming from various sources. The ‘value added' comes from the engineers who can create functionality and services, and who can adapt it to a diverse market of consumers. Yet I argue that this precious resource, the human capital represented by engineers and technologists, is too scarce and its limitation in size is a main handicap for creating a strong market of intelligent products and services. Education of engineers has to evolve and concentrate on the broader issue of system problem solving based on a deep understanding of technology. Industry has to join forces with academia by sharing knowledge and objectives and by creating a strong enthusiasm for engineering.

ASD01 ASD technical session: Enhancing Dependability and Efficiency in Automotive and Autonomous Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Roseraie 1&2

Session chair:
Selma Saidi, TU Braunschweig, DE

Session co-chair:
Dirk Ziegenbein, Robert Bosch GmbH, DE

This session explores advancements in automotive and autonomous systems, focusing on achieving predictability, reliability, and efficiency. The session begins with a proposal on extending the AUTOSAR Adaptive standard using the System-Level Logical Execution Time (SL-LET) paradigm to ensure determinism, critical for the predictability of modern automotive systems. The second presentation demonstrates noise perturbation attacks on image segmentation, a core perception component of safety-critical autonomous systems, and how they can be predicted and mitigated. Finally, a framework designed to optimize the efficiency of 3D object detection in autonomous vehicles through pattern pruning and quantization is presented, significantly enhancing real-time performance and energy efficiency on resource-limited platforms.

Time	Label	Presentation Title Authors
11:00 CEST	ASD01.1	MODELING THE SL-LET PARADIGM IN AUTOSAR ADAPTIVE Speaker: Davide Bellassai, Scuola Superiore Sant'Anna, IT Authors: Davide Bellassai¹, Gerlando Sciangula², Claudio Scordino³, Daniel Casini⁴ and Alessandro Biondi⁴ ¹Evidence S.r.l., Scuola Superiore Sant'Anna, IT; ²Huawei and Scuola Superiore Sant'Anna, IT; ³Huawei Inc, IT; ⁴Scuola Superiore Sant'Anna, IT Abstract The AUTOSAR consortium has proposed the AUTOSAR Adaptive standard to tackle the challenges introduced by the design of modern automotive functionality. It consists of a service-oriented architecture (SoA) implemented in C++ and built on top of POSIX operating systems. However, unlike the previous AUTOSAR Classic specifications, this novel standard does not address non-functional requirements, including determinism, which is of key importance to guarantee the system's functional safety. This paper proposes extensions to the AUTOSAR Adaptive standard to achieve determinism by leveraging the System-Level Logical Execution Time (SL-LET) paradigm, which is already used in the context of AUTOSAR Classic but needs to be revisited to be employed in Adaptive. We evaluate the feasibility of the proposed model extension on the AUTOSAR Adaptive Platform Demonstrator (APD), which provides an implementation of AUTOSAR Adaptive specifications using a realistic automotive application.
11:30 CEST	ASD01.2	GENERATING AND PREDICTING OUTPUT PERTURBATIONS IN IMAGE SEGMENTERS Speaker: Bryan Donyanavard, San Diego State University, US Authors: Matthew Bozoukov¹, Nguyen Anh Vu Doan² and Bryan Donyanavard³ ¹Miramar college, US; ²Infineon Technologies AG & TU Munich, DE; ³San Diego State University, US Abstract Image segmentation applications are a core component of safety-critical autonomous software pipelines. Sensor data input noise can lead to segmentation output corruption that threatens safety in both DNN- and transformer-based segmenters. Previous work has proposed methods for generating malicious noise to cause DNN- and transformer-based object detection and classification output corruption. We perform the same task for image segmentation applications using genetic algorithms for optimization. We then propose a novel method to predict whether an input image will yield a corrupted segmentation output due to noise. We evaluate the optimal noise generation and corruption prediction on state-of-the-art image segmenters YOLOv8 and DETR. We observe that we can (a) cause segmentation output corruption with noise that is undetectable to the human eye and unrelated to the corrupted region of the image; and (b) predict output corruption due to image noise with over 96\% accuracy.
12:00 CEST	ASD01.3	UPAQ: A FRAMEWORK FOR REAL-TIME AND ENERGY-EFFICIENT 3D OBJECT DETECTION IN AUTONOMOUS VEHICLES Speaker: Sudeep Pasricha, Colorado State University, US Authors: Abhishek Balasubramaniam¹, Febin Sunny² and Sudeep Pasricha³ ¹Colorado State University, N/; ²AMD, N/; ³Colorado State University, US Abstract To enhance perception in autonomous vehicles (AVs), recent efforts are concentrating on 3D object detectors, which deliver more comprehensive predictions than traditional 2D object detectors, at the cost of increased memory footprint and computational resource usage. We present a novel framework called UPAQ, which leverages semi-structured pattern pruning and quantization to improve the efficiency of LiDAR point-cloud and camera-based 3D object detectors on resource-constrained embedded AV platforms. Experimental results on the Jetson Orin Nano embedded platform indicate that UPAQ achieves up to 5.62× and 5.13× model compression rates, up to 1.97× and 1.86× boost in inference speed, and up to 2.07× and 1.87× reduction in energy consumption compared to state-of-the-art model compression frameworks, on the Pointpillar and SMOKE models respectively.

BPA01 BPA Session 1

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: St Clair 3AB

Session chair:
Lukas Sekanina, Brno University, CZ

Session co-chair:
Jie Han, University of Alberta, CA

Time	Label	Presentation Title Authors
11:00 CEST	BPA01.1	QUANTIFYING TRADE-OFFS IN POWER, PERFORMANCE, AREA, AND TOTAL CARBON FOOTPRINT OF FUTURE THREE-DIMENSIONAL INTEGRATED COMPUTING SYSTEMS Speaker: Danielle Grey-Stewart, Harvard University, US Authors: Danielle Grey-Stewart, Mariam Elgamal, David Kong, Georgios Kyriazidis, Jalil Morris and Gage Hills, Harvard University, US Abstract To address computing's carbon footprint challenge, designers of computing systems are beginning to consider carbon footprint as a first-class figure of merit, alongside conventional metrics such as power, performance, and area. To account for total carbon (tC) footprint of a computing system, carbon footprint models must consider both embodied carbon (Cembodied) due to emissions during manufacturing, and operational carbon (Coperational) from day-to-day use. Models for Coperational are relatively mature due to the direct relationship between Coperational and energy consumed while computing. In contrast, models for Cembodied primarily focus on today's silicon-based technologies, not capturing the wide range of beyond-Si technologies that are actively being developed for future computing systems, including emerging nanomaterials, emerging memory devices, and various three-dimensional (3D) integration techniques. Cembodied models for emerging technologies are essential for accurately predicting which technology directions to pursue without exacerbating computing's carbon footprint. In this paper, we (1) develop Cembodied models for 3D-integrated computing systems that leverage emerging nanotechnologies. We analyze an example fabrication process that is highly promising for energy-efficient computing: 3D integration of carbon nanotube field-effect transistors (CNFETs) and indium gallium zinc oxide (IGZO) FETs fabricated directly on top of Si CMOS at a 7 nm technology node. We show that Cembodied of this process is, on average (considering various energy grids), 1.31× higher per wafer vs. a baseline 7 nm node Si CMOS process. (2) As a case study, we quantify trade-offs in power, performance, area, and tC footprint for an embedded system comprising an ARM Cortex-M0 processor and embedded DRAM, implemented in each of the above processes. For a representative lifetime of the system (running applications from the Embench suite for 2 hours per day over 24 months, with a clock frequency of 500 MHz), we show that the 3D IGZO/CNFET/Si implementation is 1.02× more carbon-efficient per good die (considering yield) vs. the baseline Si implementation, quantified by the product of tC and application execution time (tCDP, an effective metric of carbon efficiency). (3) Finally, we show techniques to quantify carbon efficiency benefits of future computing systems, even when there is uncertainty in carbon footprint models. Specifically, we show how to robustly compare tCDP for multiple computing systems, given underlying uncertainty in Cembodied, computing system lifetime, carbon intensity (in equivalent grams of CO2 emissions per unit energy consumption), and yield.
11:20 CEST	BPA01.2	COMPUTE-IN-MEMORY ARRAY DESIGN USING STACKED HYBRID IGZO/SI EDRAM CELLS Speaker: Munhyeon Kim, Seoul National University, KR Authors: Munhyeon Kim¹, Yulhwa Kim² and Jae-Joon Kim¹ ¹Seoul National University, KR; ²Sungkyunkwan University, KR Abstract To effectively accelerate neural networks in compute-in-memory (CIM) systems, higher memory cell density is critical for managing increasing computational workloads and parameters. While CMOS-based embedded dynamic random access memory (eDRAM) is being explored as an alternative, addressing the short retention time (tret) (<1 ms) remains a challenge for system applications. Recent studies highlight that InGaZnO (IGZO)-based eDRAM achieves a significantly longer retention time (>100 s), but additional improvements are needed due to considerable cell variability and slower operating speeds compared to CMOS-based cells. This paper proposes a 3T-based stacked hybrid IGZO/Si eDRAM (Hybrid-3T) cell and array design for CIM systems, alongside a system-level evaluation for deep neural network (DNN) workloads. The Hybrid-3T cell, built on 7-nm FinFET technology, extends the retention time by 100 s compared to IGZO-based 3T eDRAM (IGZO-3T). It also provides 3.4× higher bit cell density compared to 8T SRAM cells and 2× higher density than CMOS-based 3T eDRAM (CMOS-3T), while maintaining similar throughput and variability levels as eDRAM and SRAM systems. Additionally, DNN inference accuracy for vision and natural language processing (NLP) tasks is evaluated using the proposed CIM design, considering the impact of enhanced cell variability and retention time on system-level performance. The retention time required for CIM operation accuracy (tret,CIM) is more than 10^7 times longer in Hybrid-3T than in CMOS-3T, and the retention time accounting for variability (tret,CIM v) is over 3× longer than IGZO-3T eDRAM. Consequently, the proposed Hybrid-3T eDRAM CIM integrates the strengths of both CMOS-3T and IGZO-3T CIM designs, enabling high-performance, reliable systems.
11:40 CEST	BPA01.3	TIMING-DRIVEN GLOBAL PLACEMENT BY EFFICIENT CRITICAL PATH EXTRACTION Speaker: Yunqi Shi, Nanjing University, CN Authors: Yunqi Shi¹, Siyuan Xu², Shixiong Kai², Xi Lin¹, Ke Xue¹, Mingxuan Yuan³ and Chao Qian¹ ¹Nanjing University, CN; ²Huawei Noah's Ark Lab, CN; ³Huawei Noah's Ark Lab, HK Abstract Timing optimization during the global placement of integrated circuits has been a significant focus for decades, yet it remains a complex, unresolved issue. Recent analytical methods typically use pin-level timing information to adjust net weights, which is fast and simple but neglects the path-based nature of the timing graph. The existing path-based methods, however, cannot balance the accuracy and efficiency due to the exponential growth of number of critical paths. In this work, we propose a GPU-accelerated timing-driven global placement framework, integrating accurate path-level information into the efficient DREAMPlace infrastructure. It optimizes the fine-grained pin-to-pin attraction objective and is facilitated by efficient critical path extraction. We also design a quadratic distance loss function specifically to align with the RC timing model. Experimental results demonstrate that our method significantly outperforms the current leading timing-driven placers, achieving an average improvement of 40.5% in total negative slack (TNS) and 8.3% in worst negative slack (WNS), as well as an improvement in half-perimeter wirelength (HPWL).

CFP Panel on Career Perspectives

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:00 CEST
Location / Room: Rhône 1

Whether you want to join industry or academia, you should check this panel out! Panelists from both sides will present their perspectives and their experiences. Work-life balance, independence, adventure, start-up or big company, academic path in this or that country, career changes between more or less technical company profiles, and many more topics are on the table. Don't forget to bring your questions!

Moderator: Andy Pimentel, University of Amsterdam

Panelists
: Luca Benini, ETH Zürich & Università di Bologna
Nele Mentens, Leiden University and KU Leuven
Dimitrios Mangiras, Synopsys
Mikail Yayla, Racyics

ET01 Agile Hardware Specialization: A toolbox for Agile Chip Front-end Design

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 2

Organiser:
Yun (Eric) Liang, Peking University, CN

Compared to software design, hardware design is more expensive and time-consuming. This is partly because software community has developed a rich set of modern tools to help software programmers to get projects started and iterated easily and quickly. However, the tools are seriously antiquated and lacking for hardware design. Modern digital chips are still designed manually using hardware description language such as Verilog or VHDL, which requires low-level and tedious programming, debugging, and tuning. In this tutorial, we will introduce Agile Hardware Specialization (AHS): A toolbox for Agile Chip Front-end Design.

The tutorial will highlight the methodology and open source tools in AHS for both chip design and verification. From the design perspective, AHS presents multiple ways that use different programming interfaces and target different scenarios, including;

a multi-level hardware intermediate representation based high-level synthesis flow, which uses C and C++ as the programming language. This flow also supports domain specific language and optimization for specific domains such as tensor algebra. We also design an efficient cross-level debugger for high-level synthesis that enables breakpoints and stepping at different hardware intermediate representations.
an embedded hardware description language, which uses Rust as the programming language. This flow includes a general HDL and provides deterministic timing support and procedural control logic specification.

These different methodologies exhibit different trade-offs in productivity and PPA (performance, power, and area) for chip design. From the verification perspective, we will present agile simulation and debugging tools, which can check the functional and performance behaviors of the hardware. The attendees will learn the methodology, design automation fundamentals, and software tools of AHS.

Speakers

Dr. Yun Liang, Professor, Peking University, China
Xiaochen Hao, Ph.D Candidate, Peking University, China

Target Audience

We invite DATE 2025 participants with a keen interest in chip design and verification and computer-aided design (CAD) tools. Please join us!

Learning objectives

An introduction to the AHS framework.
Details of the AHS tools including Hector, Hestia, Cement, Khronos, etc.
Hands-on experimentation using AHS tools.
Motivate future research within the AHS framework.

Required Background

Basic knowledge of programming such as C or C++
A keen interest in learning hardware specialization and Electronic Design Automation (EDA)
Desirable: prior knowledge of high level synthesis and associated tool-chains.

Detailed Program

Part 1: Lecture (1 hour)
Part 2: Hands-on session (30 mins)

Lab installation instructions and handouts are available at: https://ericlyun.me/tutorial-date2025

FS01 Focus session - Specifications Mining in a World of Generative AI: Extensions, Applications, and Pitfalls

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium Pasteur

Session chair:
Graziano Pravadelli, Università di Verona, IT

Session co-chair:
Badri Gopalan, Synopsys, US

Organisers:
Graziano Pravadelli, Università di Verona, IT
Samuele Germiniani, Marconi University of Rome, IT

The session consists of three technical contributions (15 mins each) and one panel (45 mins), totaling 90 minutes, focused on R&D challenges, emerging trends, and solutions for the automatic generation of formal specifications in system-level assertion-based verification (ABV). The first part of the session will explore the role of LLMs in assertion generation, delve into the automatic mining of assertions for security verification, and present a framework for the fair qualification and evaluation of current and future assertion miners. Then, in the second part, the panel will highlight unmet needs in specification mining, motivating researchers to develop new approaches and tools that move beyond academic proofs of concept and position automatic assertion generation as a practical, industry-ready solution for ABV.

Time	Label	Presentation Title Authors
11:00 CEST	FS01.1	ARE LLMS READY FOR PRACTICAL ADOPTION FOR ASSERTION GENERATION? Speaker: Debjit Pal, University of Illinois Chicago, US Authors: Vaishnavi Pulavarthi¹, Deeksha Nandal² and Debjit Pal² ¹UIC, US; ²University of Illinois at Chicago, US Abstract Assertions have been the de facto collateral for simulation-based and formal verification of hardware designs for over a decade. The quality of hardware verification, i.e., detection and diagnosis of corner-case design bugs, is critically dependent on the quality of the assertions. With the onset of generative AI such as Transformers and Large-Language Models (LLMs), there has been a renewed interest in developing novel, effective, and scalable techniques of generating functional and security assertions from design source code. While there have been recent works that use commercial-of-the-shelf (COTS) LLMs for assertion generation, there is no comprehensive study in quantifying the effectiveness of LLMs in generating syntactically and semantically correct assertions. In this paper, we first discuss AssertionBench from our prior work, a comprehensive set of designs and assertions to quantify the goodness of a broad spectrum of COTS LLMs for the task of assertion generations from hardware design source code. Our key insight was that COTS LLMs are not yet ready for prime-time adoption for assertion generation as they generate a considerable fraction of syntactically and semantically incorrect assertions. Motivated by the insight, we propose AssertionLLM, a first of its kind LLM model, specifically fine-tuned for as- sertion generation. Our initial experimental results show that AssertionLLM considerably improves the semantic and syntactic correctness of the generated assertions over COTS LLMs.
11:15 CEST	FS01.2	SECURITY ASSERTIONS FOR TRUSTED EXECUTION ENVIRONMENTS Speaker: Prabhat Mishra, University of Florida, US Authors: Hasini Witharana, Hansika Weerasena and Prabhat Mishra, University of Florida, US Abstract Trusted Execution Environment (TEE) provides a secure and isolated execution environment for sensitive applications. In order to design secure and trustworthy TEE-based systems, it is crucial to verify the trustworthiness of TEE implementations. Property checking is a promising avenue to guarantee that the TEE implementation satisfies the security properties. In the presence of a vulnerability, property checking will fail and provide a counterexample that can be utilized to fix the vulnerability. A major challenge in TEE property checking is that it relies on manual definition of the security properties, which can be cumbersome and error-prone. In this paper, we propose an efficient framework for automated generation and verification of TEE specific properties. Specifically, we leverage Finite State Machine (FSM) analysis to automatically derive and validate security properties utilizing templates. The effectiveness of the proposed method is demonstrated through experimental evaluation of Intel Trust Domain Extension (TDX), highlighting its potential for verifying security and trustworthiness of modern trusted execution environments.
11:30 CEST	FS01.3	A BASELINE FRAMEWORK FOR THE QUALIFICATION OF SPECIFICATIONS MINERS Speaker: Samuele Germiniani, University of Guglielmo Marconi and University of Verona, IT Authors: Samuele Germiniani¹, Daniele Nicoletti² and Graziano Pravadelli² ¹University of Guglielmo Marconi and University of Verona, IT; ²Università di Verona, IT Abstract Over the past few decades, the verification community has developed several specification miners as an alternative to manual assertion definition. However, assessing their effectiveness remains a challenging task. Most studies evaluate these miners using predefined ranking metrics, which often fail to ensure the quality of the inferred specifications, especially when no fixed ground truth exists and the relevance of the specifications varies depending on the use case. This paper presents a comprehensive framework aimed at facilitating the evaluation and comparison of LTL specification miners. Unlike traditional approaches, which struggle with subjective analyses and complex tool configurations, our framework provides a structured method for assessing and comparing the quality of specifications generated by multiple sources, using both semantic and syntactic techniques. To achieve this, the framework offers users an easy-to-extend environment for installing, configuring, and running third-party miners via Docker containers. Additionally, it supports the inclusion of new evaluation methods through a modular design. Miner comparison can be based either on user-defined designs or on synthetic benchmarks, which are automatically generated to serve as a non-subjective ground truth for the evaluation of the miners. We demonstrate the utility of our framework through comparative analyses with four well-known LTL miners, illustrating its ability to standardize and enhance the specification mining evaluation process.
11:45 CEST	FS01.4	SPECIFICATION MINING FACING GENERATIVE AI Speaker: Goerschwin Fey, TU Hamburg, DE; Harry Foster, Siemens/Mentor Graphics, US; Jaan Raik, Tallinn University of Technology, EE; Joerg Mueller, Cadence Design Systems, DE Authors: Goerschwin Fey¹, Harry Foster², Tara Ghasempouri³, Badri Gopalan⁴, Joerg Mueller⁵ and Manish Pandey⁴ ¹TU Hamburg, DE; ²Siemens/Mentor Graphics, US; ³Department of Computer System, Tallinn University of Technology, Estonia, EE; ⁴Synopsys, US; ⁵Cadence Design Systems, DE Abstract Specifications for complex designs and their consistency are always a headache. Automated specification mining – including but not limited to generative AI – offers attractive solutions, but there are also various unmet needs. This panel will highlight unmet needs in specification mining, motivating researchers to develop new approaches and tools that move beyond academic proofs of concept and position automatic assertion generation as a practical, industry-ready solution for ABV.

LKS01 Later … with the keynote speakers

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:00 CEST
Location / Room: Terreaux VIP Lounge

TS01 Emerging design technologies for future computing

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 3AB

Session chair:
Tony Wu, Meta, US

Session co-chair:
Yvain Thonnart, CEA-Leti, FR

Time	Label	Presentation Title Authors
11:00 CEST	TS01.1	OPTIMAL SYNTHESIS OF MEMRISTIVE MIXED-MODE CIRCUITS Speaker: Ilia Polian, University of Stuttgart, DE Authors: Ilia Polian¹, Xianyue Zhao², Li-Wei Chen¹, Felix Bayhurst¹, Ziang Chen², Heidemarie Schmidt² and Nan Du² ¹University of Stuttgart, DE; ²University of Jena and Leibniz Institute of Photonic Technology, Jena, Germany, DE Abstract Memristive crossbars are attractive for in-memory computing due to their integration density combined with compute and storage capabilities of their basic devices. However, yield and fidelity of emerging memristive technologies can make their reliable operation unattainable, thus raising interest in simpler topologies. In this paper, we consider synthesis of Boolean functions on 1D memristive line arrays. We propose an optimal procedure that can fully utilize the rich electrical behavior of memristive devices, mixing stateful (resistance-input) and nonstateful (voltage-input) operations as desired by the designer, leveraging their respective strengths. The synthesis method is based on Boolean satisfiability (SAT) solving and supports flexible constraints to enforce, e.g., restrictions of the available peripherals. We experimentally validate memristive logic circuits beyond individual logic gates by demonstrating the operation of a Galois field multiplier using a 1D line array of 10 memristors in parallel, highlighting the robust performance of our proposed mixed-mode circuit and its synthesis procedure.
11:05 CEST	TS01.2	NVCIM-PT: AN NVCIM-ASSISTED PROMPT TUNING FRAMEWORK FOR EDGE LLMS Speaker: Ruiyang Qin, University of Notre Dame, US Authors: Ruiyang Qin¹, Pengyu Ren², Zheyu Yan³, Liu Liu², Dancheng Liu⁴, Amir Nassereldine⁴, Jinjun Xiong⁴, Kai Ni², X. Sharon Hu² and Yiyu Shi² ¹Villanova University, US; ²University of Notre Dame, US; ³Zhejiang University, CN; ⁴University at Buffalo, US Abstract Large Language Models (LLMs) deployed on edge devices, known as edge LLMs, only use constrained resources to learn from user-generated data. Although existing learning methods have demonstrated performance improvements for edge LLMs, their constraints in high resource cost and low learning capacity limit their effectiveness as optimal learning methods for edge LLMs. Prompt tuning (PT), a learning method without these constraints, has significant potential to improve edge LLM performance while modifying only a small portion of LLM parameters. However, PT-based edge LLMs can suffer from user domain shift, leading to repetitive training that neither effectively improves performance nor resource efficiency. Conventional efforts to address domain shifts involve more complex neural network designs and sophisticated training, inevitably resulting in higher resource usage. It remains an open question: how can we avoid domain shift and high resource usage for edge LLM PT? In this paper, we propose a prompt tuning framework for edge LLMs, exploiting the benefits offered by non-volatile computing-in-memory (NVCiM) architectures. We introduce a novel NVCiM-assisted PT framework, where we narrow down the core operations to matrix-matrix multiplication, accelerated by performing in-situ computation on NVCiM. To the best of our knowledge, this is the first work employing NVCiM to improve the edge LLM PT performance.
11:10 CEST	TS01.3	PICELF: AN AUTOMATIC ELECTRONIC LAYER LAYOUT GENERATION FRAMEWORK FOR PHOTONIC INTEGRATED CIRCUITS Speaker: Xiaohan Jiang, The Hong Kong University of Science and Technology, HK Authors: Xiaohan Jiang¹, Yinyi Liu¹, Peiyu Chen², Wei Zhang¹ and Jiang Xu² ¹The Hong Kong University of Science and Technology, HK; ²The Hong Kong University of Science and Technology (Guangzhou), CN Abstract In recent years, the advent of photonic integrated circuits (PICs) has demonstrated great prospects and applications to address critical issues such as limited bandwidth, high latency, and high power consumption in data-intensive systems. However, the field of physical design automation for PICs remains in its infancy, with a notable gap in electronic layer layout design tools. Current research on PIC physical design automation primarily focuses on optical layer layouts, often overlooking the equally crucial electronic layer layouts. Although well-established for conventional integrated circuits (ICs), existing EDA tools are inadequately adapted for PICs due to their unique characteristics and constraints. As PICs grow in integration density and size, traditional manual-based design methods become increasingly inefficient and sub-optimal, potentially compromising overall PIC performance. To address this challenge, we propose PICELF, the first framework in the literature for automatic PIC electronic layer layout generation. Our framework comprises a nonlinear binary programming (NBP)-based netlist generator with scalability optimization and a two-stage router featuring initial parallel routing followed by post-routing optimization. We validate our framework's effectiveness and efficiency using a real PIC chip benchmark established by us. Experimental results demonstrate that our method can efficiently generate high-quality PIC electronic layer layouts and satisfy all design rules, within reasonable CPU times, while related existing methods are not applicable.
11:15 CEST	TS01.4	SYSTEM LEVEL PERFORMANCE EVALUATION FOR SUPERCONDUCTING SYSTEMS Speaker: Debjyoti Bhattacharjee, IMEC, BE Authors: Joyjit Kundu, Debjyoti Bhattacharjee, Nathan Josephsen, Ankit Pokhrel, Udara De Silva, Wenzhe Guo, Steven Winckel, Steven Brebels, Manu Perumkunnil, Quentin Herr and Anna Herr, imec, BE Abstract Superconducting Digital~(SCD) technology offers significant potential for enhancing the performance of next generation large scale compute workloads. By leveraging advanced lithography and a 300 mm platform, SCD devices can reduce energy consumption and boost computational power. This paper presents an analytical performance modeling approach to evaluate the system-level performance benefits of SCD architectures for LLM training and inference. Our findings, based on experimental data and Pulse Conserving Logic~(PCL) design principles, demonstrate substantial improvements in both training and inference. SCD's ability to address memory and interconnect limitations positions it as a promising solution for next-generation compute systems.
11:20 CEST	TS01.5	INTEGRATED HARDWARE ANNEALING BASED ON LANGEVIN DYNAMICS FOR ISING MACHINES Speaker: Hui Wu, University of Rochester, US Authors: Yongchao Liu, Lianlong Sun, Michael Huang and Hui Wu, University of Rochester, US Abstract Ising machines are non-von Neumann machines designed to solve combinatorial optimization problems (COP) by searching for the ground state, or the lowest energy configuration, within the Ising model. However, Ising machines often face the challenges of getting trapped in local minima due to the complex energy landscapes. Hardware annealing algorithms help mitigate this issue by using a probabilistic approach to steer the system toward the ground state. In this paper, we present a hardware annealing algorithm for Ising machines based on Langevin dynamics, a stochastic perturbation by random noise. Theoretical analysis, system-level design, and detailed circuit design are carried out. We evaluate the performance of the algorithm through chip-level simulation using a standard 65-nm CMOS technology to demonstrate the algorithm's efficacy. The results show that the proposed hardware annealing algorithm effectively guides the system to reach the ground state with a probability of 86.5\%, significantly improving the solution quality by 97.5\%. Further, we compare the algorithm with state-of-the-art hardware annealing methods through behavioral-level simulations, highlighting its improved solution quality alongside a 50\% reduction in time-to-solution.
11:25 CEST	TS01.6	NORA: NOISE-OPTIMIZED RESCALING OF LLMS ON ANALOG COMPUTE-IN-MEMORY ACCELERATORS Speaker: Garrett Gagnon, Rensselaer Polytechnic Institute, US Authors: Yayue Hou¹, Hsinyu Tsai², Kaoutar El Maghraoui², Tayfun Gokmen², Geoffrey Burr² and Liu Liu¹ ¹Rensselaer Polytechnic Institute, US; ²IBM, US Abstract Large Language Models (LLMs) have become critical in AI applications, yet current digital AI accelerators suffer from significant energy inefficiencies due to frequent data movement. Analog compute-in-memory (CIM) accelerators offer a potential solution for improving energy efficiency but introduce non-idealities that can degrade LLM accuracy. While analog CIM has been extensively studied for traditional deep neural networks, its impact on LLMs remains unexplored, particularly concerning the large influence of Analog CIM non-idealities. In this paper, we conduct a sensitivity analysis on the effects of analog-induced noise on LLM accuracy. We find that while LLMs demonstrate robustness to weight-related noise, they are highly sensitive to quantization noise and additive Gaussian noise. Based on these insights, we propose a noise-optimized rescaling method to mitigate LLM accuracy loss by shifting the non-ideality burden from the sensitive input/output to the more resilient weight. Through rescaling, we can implement the OPT-6.7b model on simulated analog CIM hardware with less than 1% accuracy loss from the floating-point baseline, compared to a much higher loss of around 30% without rescaling.
11:30 CEST	TS01.7	BOSON−1: UNDERSTANDING AND ENABLING PHYSICALLY-ROBUST PHOTONIC INVERSE DESIGN WITH ADAPTIVE VARIATION-AWARE SUBSPACE OPTIMIZATION Speaker: Haoyu Yang, Nvidia Inc., US Authors: Pingchuan Ma¹, Zhengqi Gao², Amir Begovic³, Meng Zhang³, Haoyu Yang⁴, Haoxing Ren⁴, Rena Huang³, Duane Boning² and Jiaqi Gu¹ ¹Arizona State University, US; ²Massachusetts Institute of Technology, US; ³Rensselaer Polytechnic Institute, US; ⁴NVIDIA Corp., US Abstract Nanophotonic device design aims to optimize photonic structures to meet specific requirements across various applications. Inverse design has unlocked non-intuitive, high-dimensional design spaces, enabling the discovery of compact, high-performance device topologies beyond traditional heuristic or analytic methods. The adjoint method, which calculates analytical gradients for all design variables using just two electromagnetic simulations, enables efficient navigation of this complex space. However, many inverse-designed structures, while numerically plausible, are difficult to fabricate and highly sensitive to physical variations, limiting their practical use. The discrete material distributions with numerous local-optimal structures also pose significant optimization challenges, often causing gradient-based methods to converge on suboptimal designs. In this work, we formulate inverse design as a fabrication-restricted, discrete, probabilistic optimization problem and introduce BOSON-1, an end-to-end, adaptive, variation-aware subspace optimization framework to address the challenges of manufacturability, robustness, and optimizability. With elegant reparametrization, we explicitly emulate the fabrication process and differentiably optimize the design in the fabricable subspace. To overcome optimization difficulty, we propose dense target-enhanced gradient flows to mitigate misleading local optima and introduce a conditional subspace optimization strategy to create high-dimensional tunnels to escape local optima. Furthermore, we significantly reduce the prohibitive runtime associated with optimizing across exponential variation samples through an adaptive sampling-based robust optimization method, ensuring both efficiency and variation robustness. On three representative photonic device benchmarks, our proposed inverse design methodology BOSON-1 delivers fabricable structures and achieves the best convergence and performance under realistic variations, outperforming prior arts with 74.3% post-fabrication performance.
11:35 CEST	TS01.8	BIMAX: A BITWISE IN-MEMORY ACCELERATOR USING 6T-SRAM STRUCTURE Speaker: Nezam Rohbani, BSC, ES Authors: Nezam Rohbani¹, Mohammad Arman Soleimani², behzad salami³, Osman Unsal³, Adrian Cristal Kestelman³ and Hamid Sarbazi-Azad⁴ ¹Institute for Research in Fundamental Sciences (IPM), IR; ²Sharif University of Technology, IR; ³BSC, ES; ⁴Sharif U of Tech, IR Abstract In-memory computing (IMC) paradigm reduces costly and inefficient data transfer between memory modules and processing cores by implementing simple and parallel operations inside the memory subsystem. SRAM, the fastest memory structure in the memory hierarchy, is an appropriate platform to implement IMC. However, the main challenges of implementing IMC in SRAM are the limited operations and unreliable accuracy due to environmental noise and process variations. This work proposes a low-latency, energy-efficient, and noise-robust IMC technique, called Bitwise In-Memory Accelerator using 6T-SRAM Structure (BIMAX). BIMAX performs parallel bitwise operations (i.e., (N)AND, (N)OR, NOT, X(N)OR) as well as row-copy with the capability of writing the computation result back to a target memory row. BIMAX functionality is based on an imbalanced differential sense amplifier (SA) that reads and writes data from and into multiple 6T-SRAM cells. The simulations show BIMAX performs these operations with 52.7% lower energy dissipation compared to the state-of-the-art IMC technique, with 5.7% average higher performance rate. Furthermore, BIMAX is about 5.4× more robust against environmental noises compared to the state-of-the-art.
11:40 CEST	TS01.9	DSC-ROM: A FULLY DIGITAL SPARSITY-COMPRESSED COMPUTE-IN-ROM ARCHITECTURE FOR ON-CHIP DEPLOYMENT OF LARGE-SCALE DNNS Speaker: Tianyi Yu, Tsinghua University, CN Authors: Tianyi Yu, Zhonghao Chen, Yiming Chen, Shuang Wang, Yongpan Liu, Huazhong Yang and Xueqing Li, Tsinghua University, CN Abstract Compute-in-Memory (CiM) is a promising technique to mitigate the memory bottleneck for energy-efficient deep neural network (DNN) inference. Unfortunately, conventional SRAM-based CiM has low density and limited on-chip capacity, resulting in undesired weight reloading from off-chip DRAM. The emerging high-density ROM-based CiM architecture has recently revealed the opportunity of deploying large-scale DNNs on-chip, with optional assisting SRAM to ensure moderate flexibility. However, prior analog-domain ROM CiM still suffers from limited memory density improvement and low computing area efficiency due to stringent array structure and large A/D converter (ADC) overhead. This paper presents DSC-ROM, a fully digital sparsity-compressed compute-in-ROM architecture to address these challenges. DSC-ROM introduces a fully synthesizable macro-level design methodology that achieves a record-high memory density of 27.9 Mb/mm^2 in a 28nm CMOS technology. Experimental results show that the macro area efficiency of DSC-ROM improves by 5.6-6.6x compared with prior analog-based ROM CiM. Furthermore, a novel weight fine-tuning technique is proposed to ensure task transfer flexibility and reduce required assisting SRAM cells by 94.4%. Experimental results show that DSC-ROM designed for ResNet-18 pre-trained on ImageNet dataset achieves <0.5% accuracy loss in CIFAR-10 and FER2013, compared with the fully SRAM-based CiM.
11:45 CEST	TS01.10	COMPACT NON-VOLATILE LOOKUP TABLE ARCHITECTURE BASED ON FERROELECTRIC FET ARRAY THROUGH IN-SITU COMBINATORIAL ONE-HOT ENCODING FOR RECONFIGURABLE COMPUTING Speaker: Weikai Xu, Peking University, CN Authors: Weikai Xu, Meng Li, Qianqian Huang and Ru Huang, Peking University, CN Abstract Lookup tables (LUTs) are widely used for reconfigurable computing applications due to the capability of implementing arbitrary logic functions. Various emerging non-volatile memories (eNVMs) have been introduced for LUT designs with reduced hardware cost and power consumption compared with conventional SRAM-based LUT. However, the existing designs still follow the conventional LUT architecture, where the memory cells are only used for storage of configuration bits, requiring dedicated bulky multiplexer (MUX) for computation of each LUT, resulting in inevitable high area, latency, and energy cost. In this work, a compact and efficient non-volatile LUT architecture based on ferroelectric FET (FeFET) array is proposed, where the configuration bit storage and computation can be implemented within the FeFET array through in-situ combinatorial one-hot encoding, eliminating the need of costly MUX for each LUT. Moreover, multibit LUTs can be efficiently implemented in the FeFET array using only one shared decoder instead of multiple costly MUXs. Due to the eliminated MUX in the calculation path, the proposed LUT can also achieve enhanced computation speed compared with the conventional LUTs. Based on the proposed LUT architecture, the input expansion of LUT, full adder, and content addressable memory are further implemented and demonstrated with reduced hardware and energy cost. Evaluation results show that the proposed FeFET array-based LUT architecture achieves 51.7×/8.3× reduction in area-energy-delay product compared with conventional SRAM-based/FeFET-based LUT architecture, indicating its great potential for reconfigurable computing applications.
11:50 CEST	TS01.11	GRAMC: GENERAL-PURPOSE AND RECONFIGURABLE ANALOG MATRIX COMPUTING ARCHITECTURE Speaker: Lunshuai Pan, Peking University, CN Authors: Lunshuai Pan, Shiqing Wang, Pushen Zuo and Zhong Sun, Peking University, CN Abstract In-memory analog matrix computing (AMC) with resistive random-access memory (RRAM) represents a highly promising solution that solves matrix problems in one step. However, the existing AMC circuits each have a specific connection topology to implement a single computing function, lack of the universality as a matrix processor. In this work, we design a reconfigurable AMC macro for general-purpose matrix computations, which is achieved by configuring proper connections between memory array and amplifier circuits. Based on this macro, we develop a hybrid system that incorporates an on-chip write-verify scheme and digital functional modules, to deliver a general-purpose AMC solver for various applications.
11:51 CEST	TS01.12	SHWCIM：A SCALABLE HETEROGENEOUS WORKLOAD COMPUTING-IN-MEMORY ARCHITECTURE Speaker: Yanfeng Yang, School of Microelectronics, South China University of Technology, CN Authors: Yanfeng Yang¹, Yi Zou², Zhibiao Xue² and Liuyang Zhang³ ¹School of Integrated Circuits, South China University of Technology, CN; ²School of Microelectronics, South China University of Technology, CN; ³School of Microelectronics, Southern University of Science and Technology, CN Abstract This study introduces HWCIM, a SRAM-based Computing-In-Memory core, and SHWCIM, a CIM-capable Coarse-Grained Reconfigurable Architecture, to enhance resource utilization, multi-functionality, and on-chip memory size in SRAM-based CIM designs. Evaluated using the SMIC 55nm process, HWCIM achieves 1.6× lower power, 2.8× higher energy efficiency, and up to 4.1× smaller area compared to previous CIM and CGRA works. Additionally, SHWCIM delivers an average 105.9× speedup over existing CGRAs and consumes 2–5× less energy than the Nvidia A40 GPU on realistic workloads.

TS02 Secure systems, circuits, and architectures

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Salon Pasteur

Session chair:
Naghmeh Karimi, University of Maryland Baltimore County, US

Session co-chair:
Jo Vliegen, KU Leuven, BE

Time	Label	Presentation Title Authors
11:00 CEST	TS02.1	FLEXENM: A FLEXIBLE ENCRYPTING-NEAR-MEMORY WITH REFRESH-LESS EDRAM-BASED MULTI-MODE AES Speaker: Hyunseob Shin, Korea University, KR Authors: Hyunseob Shin and Jaeha Kung, Korea University, KR Abstract On-chip cryptography engines face significant challenges in efficiently processing large volumes of data while maintaining security and versatility. Most existing solutions support only a single AES mode, limiting their applicability across diverse use cases. This paper introduces FlexENM, a low-power and area-efficient near-eDRAM encryption engine. The FlexENM implements refresh-less operation by leveraging inherent characteristics of the AES algorithm, reordering AES stages, and employing a simultaneous read and write scheme using dual-port eDRAM. Furthermore, FlexENM supports three AES modes, parallelizing their operations and sharing hardware resources across different modes to improve compute efficiency. Compared to other AES engines, FlexENM achieves 16% lower power consumption and 83% higher throughput per unit area, on average, demonstrating improved power- and area-efficiency for on-chip data protection.
11:05 CEST	TS02.2	PASTA ON EDGE: CRYPTOPROCESSOR FOR HYBRID HOMOMORPHIC ENCRYPTION Speaker: Aikata Aikata, TU Graz, AT Authors: Aikata Aikata¹, Daniel Sobrino² and Sujoy Sinha Roy¹ ¹TU Graz, AT; ²Universidad Politécnica de Madrid, ES Abstract Fully Homomorphic Encryption (FHE) enables privacy-preserving computation but imposes significant computational and communication overhead on the client for the public-key encryption. To alleviate this burden, previous works have introduced the Hybrid Homomorphic Encryption (HHE) paradigm, which combines symmetric encryption with homomorphic decryption to enhance performance for the FHE client. While early HHE schemes focused on binary data, modern versions now support integer prime fields, improving their efficiency for practical applications such as secure machine learning. Despite several HHE schemes proposed in the literature, there has been no comprehensive study evaluating their performance or area advantages over FHE for encryption tasks. This paper addresses this gap by presenting the first implementation of an HHE scheme- PASTA. It is a symmetric encryption scheme over integers designed to facilitate fast client encryption and homomorphic symmetric decryption on the server. We provide its performance results for both FPGA and ASIC platforms, including a RISC-V System-on-Chip (SoC) implementation on a low-end 130nm ASIC technology, which achieves a 43–171x speedup compared to a CPU. Additionally, on high-end 7nm and 28nm ASIC platforms, our design demonstrates a 97x speedup over prior public-key client accelerators for FHE. We have made our design public and benchmarked an application to support future research.
11:10 CEST	TS02.3	DESIGN, IMPLEMENTATION AND VALIDATION OF NSCP: A NEW SECURE CHANNEL PROTOCOL FOR HARDENED IOT Speaker: Vittorio Zaccaria, Politecnico di Milano, IT Authors: Joan Bushi¹, Alberto Battistello², Guido Bertoni² and Vittorio Zaccaria¹ ¹Politecnico di Milano, IT; ²Security Pattern, IT Abstract This paper deals with the design, implementation, and validation of a new secure channel protocol to connect microcontrollers and secure elements. The new secure channel protocol (NSCP) relies on a lightweight cryptographic primitive (Xoodyak) and simplified operating principles to provide secure data exchange. The performance of the new protocol is compared with that of GlobalPlatform's Secure Channel Protocol 03 (SCP03), the current extit{de facto} standard for hardening the connection between a microcontroller and a secure element in industrial IoT. The evaluation was performed in two scenarios where the secure element was emulated with an Arm Cortex M4 and a OpenHW RISC-V MPU synthesized on an Artix FPGA. The results of the evaluation are an indicator of the potential advantage of the new protocol over SCP03: In the best case, the new protocol is able to apply cryptographic protection to messages from 3.64x to 4x with respect to SCP03 at its maximum security level. The speedup in the channel initiation process is also considerable, with a factor of up to 3.7. These findings demonstrate that it is possible to conceive a new protocol which offers adequate cryptographic protection, while being more lightweight than the present standard.
11:15 CEST	TS02.4	RHYCHEE-FL: ROBUST AND EFFICIENT HYPERDIMENSIONAL FEDERATED LEARNING WITH HOMOMORPHIC ENCRYPTION Speaker: Yujin Nam, University of California, San Diego, US Authors: Yujin Nam¹, Abhishek Moitra², Yeshwanth Venkatesha², Xiaofan Yu¹, Gabrielle De Micheli¹, Xuan Wang¹, Minxuan Zhou³, Augusto Vega⁴, Priyadarshini Panda² and Tajana Rosing¹ ¹University of California, San Diego, US; ²Yale University, US; ³Illinois Tech, US; ⁴IBM Research, US Abstract Federated learning (FL) is a widely-used collaborative learning approach where clients train models locally without sharing their data with servers. However, privacy concerns remain since clients still upload locally trained models, which could reveal sensitive information. Fully homomorphic encryption (FHE) addresses this issue by enabling clients to share encrypted models and the server to aggregate them without decryption. While FHE resolves the privacy concerns, the encrypted data introduces larger communication and computational complexity. Moreover, ciphertexts are vulnerable to channel noise, where a single bit error can disrupt model convergence. To overcome these limitations, we introduce Rhychee-FL, the first lightweight and noise-resilient FHE-enabled FL framework based on Hyperdimensional Computing (HDC), a low-overhead training method. Rhychee-FL leverages HDC's small model size and noise resilience to reduce communication overhead and enhance model robustness without sacrificing accuracy or privacy. Additionally, we thoroughly investigate the parameter space of Rhychee-FL and propose an optimized system in terms of computation and communication costs. Finally, we show that our global model can successfully converge without being impacted by channel noise. Rhychee-FL achieves comparable final accuracy to CNN, while reaching 90% accuracy in 6x fewer rounds and with 2.2x greater communication efficiency. Our framework shows at least 4.5x faster client side latency compared to previous FHE-based FL works.
11:20 CEST	TS02.5	COMPROMISING THE INTELLIGENCE OF MODERN DNNS: ON THE EFFECTIVENESS OF TARGETED ROW PRESS Speaker: Shaahin Angizi, New Jersey Institute of Technology, US Authors: Ranyang Zhou¹, Jacqueline Liu², Sabbir Ahmed³, Shaahin Angizi¹ and Adnan Siraj Rakin² ¹New Jersey Institute of Technology, US; ²Binghamton University, US; ³Binghamton University (SUNY), US Abstract Recent advancements in side-channel attacks have revealed the vulnerability of modern Deep Neural Networks (DNNs) to malicious adversarial weight attacks. The well-studied RowHammer attack has effectively compromised DNN performance by inducing precise and deterministic bit-flips in the main memory (e.g., DRAM). Similarly, RowPress has emerged as another effective strategy for flipping targeted bits in DRAM. However, the impact of RowPress on deep learning applications has yet to be explored in the existing literature, leaving a fundamental research question unanswered: How does RowPress compare to RowHammer in leveraging bit-flip attacks to compromise DNN performance? This paper is the first to address this question and evaluate the impact of RowPress on DNN applications. We conduct a comparative analysis utilizing a novel DRAM-profile-aware attack designed to capture the distinct bit-flip patterns caused by RowHammer and RowPress. Eleven widely-used DNN architectures trained on different benchmark datasets deployed on a Samsung DRAM chip conclusively demonstrate that they suffer from a drastically more rapid performance degradation under the RowPress attack compared to RowHammer. The difference in the underlying attack mechanism of RowHammer and RowPress also renders existing RowHammer mitigation mechanisms ineffective under RowPress. As a result, RowPress introduces a new vulnerability paradigm for DNN compute platforms and unveils the urgent need for corresponding protective measures.
11:25 CEST	TS02.6	COALA: COALESCION-BASED ACCELERATION OF POLYNOMIAL MULTIPLICATION FOR GPU EXECUTION Speaker: Homer Gamil, New York University, US Authors: Homer Gamil, Oleg Mazonka and Michail Maniatakos, New York University Abu Dhabi, AE Abstract In this study, we introduce Coala, a novel framework designed to enhance the performance of finite field transformations for GPU environments. We have developed a GPU-optimized version of the Discrete Galois Transformation (DGT), a variant of the Number Theoretic Transform (NTT). We introduce a novel data access pattern scheme specifically engineered to enable coalesced accesses, significantly enhancing the efficiency of data transfers between global and shared memory. This enhancement not only boosts execution efficiency but also optimizes the interaction with the GPU's memory architecture. Additionally, Coala presents a comprehensive framework that optimizes the allocation of computational tasks across the GPU's architecture and execution kernels, thereby maximizing the use of GPU resources. Lastly, we provide a flexible method to adjust security levels and polynomial sizes through the incorporation of an in-kernel RNS method, and a flexible parameter generation approach. Comparative analysis against current state-of-the-art techniques reveals significant improvements. We observe performance gains of 2.82x - 17.18x against other DGT works on GPUs for different parameters, achieved concurrently with equal or lesser memory utilization.
11:30 CEST	TS02.7	HEILP: AN ILP-BASED SCALE MANAGEMENT METHOD FOR HOMOMORPHIC ENCRYPTION COMPILER Speaker: Weidong Yang, Shanghai Jiao Tong University, CN Authors: Weidong Yang, Shuya Ji, Jianfei Jiang, Naifeng Jing, Qin Wang, Zhigang Mao and Weiguang Sheng, Shanghai Jiao Tong University, CN Abstract RNS-CKKS, a fully homomorphic encryption (FHE) scheme, enabling secure computation on encrypted data, has widely be used in statistical analysis and data mining. However, developing RNS-CKKS programs requires substantial knowledge of cryptography, which is unfriendly to non-expert programmers. A critical obstacle is the scale management, which affects the complexity of programming and performance. Different FHE operations impose specific requirements on the scale and level, necessitating programmer intervention to ensure the recoverability of the results. Furthermore, operations at different levels have a significant impact on program performance. Existing methods rely on heuristic insights or iterative methods to manage the scales of ciphertexts. However, these methods lack a holistic understanding of the optimization space, leading to inefficient exploration and suboptimal performance. This work proposes HEILP, the first constrained-optimization-based approach for scale management in FHE. HEILP expresses node scale decision and scale management operation inserting as an integer linear programming model which can be solved with existing mathematical techniques in one shot. Our method creates a more comprehensive optimization space and enables a faster and more efficient exploration. Experimental results demonstrate that HEILP achieves an average performance improvement of 1.72xover existing heuristic method, and outperforms a 1.19x performance improvement with 48.65x faster compilation time compared to the state-of-the-art iteration-based method.
11:35 CEST	TS02.8	A UNIFIED VECTOR PROCESSING UNIT FOR FULLY HOMOMORPHIC ENCRYPTION Speaker: Jiangbin Dong, Xi an Jiaotong University, CN Authors: Jiangbin Dong¹, Xinhua Chen² and Mingyu Gao³ ¹Xi'an Jiaotong University, CN; ²Fudan University, CN; ³Tsinghua University, CN Abstract Fully homomorphic encryption (FHE) algorithms enable privacy-preserving computing directly on encrypted data without leaking sensitive contents, while their excessive computational overheads could be alleviated by specialized hardware accelerators. The vector architecture has been prominently used for FHE accelerators to match the underlying polynomial data structures. While most FHE operations can be efficiently supported by vector processing units, the number theoretic transform (NTT) and automorphism operators involve complex and irregular data permutations among vector elements, and thus are handled with separate dedicated hardware units in existing FHE accelerators. In this paper, we present an efficient inter-lane network design and the corresponding dataflow control scheme, in order to realize NTT and automorphism operations among the multiple lanes of a vector unit. An arbitrarily large operator is first decomposed to fit in the fixed width of the vector unit, and the required data permutation and transposition are conducted on the specialized inter-lane network. Compared to previous designs, our solution reduces the hardware resources needed, with up to 9.4x area and 6.0x power savings for only the inter-lane network, and up to 1.2x area and 1.1x power savings for the whole vector unit.
11:40 CEST	TS02.9	TESTING ROBUSTNESS OF HOMOMORPHICALLY ENCRYPTED SPLIT MODEL LLMS Speaker: Lars Folkerts, University of Delaware, US Authors: Lars Folkerts and Nektarios Georgios Tsoutsos, University of Delaware, US Abstract Large language models (LLMs) have recently transformed many industries, enhancing content generation, customer service agents, data analysis, and even software generation. These applications are often hosted on remote servers to protect the neural-network model IP; however, this raises concerns about the privacy of input queries. Fully Homomorphic Encryption (FHE), an encryption technique that allows computations on private data, has been proposed as a solution to this challenge. Nevertheless, due to the increased size of LLMs and the computational overheads of FHE, today's practical FHE LLMs are implemented using a split model approach. Here, a user sends their FHE encrypted data to the server to run an encrypted attention head layer; then the server returns the result of the layer for the user to run the rest of the model locally. By employing this method, the server maintains part of their model IP, while the user still gets to perform private LLM inference. In this work, we evaluate the neural-network model IP protections of single-layer split model LLMs, and demonstrate a novel attack vector that makes it easy for a user to extract the neural network model IP from the server, bypassing the claimed protections for encrypted computation. In our analysis, we demonstrate the feasibility of this attack, and discuss potential mitigations.
11:45 CEST	TS02.10	TARN: TRUST AWARE ROUTING TO ENHANCE SECURITY IN 3D NETWORK-ON-CHIPS Speaker: Naghmeh Karimi, University of Maryland Baltimore County, US Authors: Hasin Ishraq Reefat¹, Alec Aversa², Ioannis Savidis² and Naghmeh Karimi¹ ¹University of Maryland Baltimore County, US; ²Drexel University, US Abstract The growing complexity and performance demands of modern computing systems resulted in a shift from traditional System-on-Chip (SoC) designs to Network-on-Chip (NoC) architectures, and further to three-dimensional Network-on-Chip (3D NoC) solutions. Despite their performance and power efficiency, the increased complexity and inter-layer communication of 3D NoCs can create opportunities for adversaries who opt to prevent reliable communications between embedded nodes by inserting hardware Trojans in such nodes. The hardware Trojans, introduced through untrusted third-party Intellectual Property (IP) blocks, can severely compromise 3D NoCs by tampering with data integrity, misrouting packets, or dropping them; thus triggering denial-of-service attacks. Detecting such behaviors is particularly difficult due to their infrequent activation. Thereby it is of utmost importance to take the trustworthiness of the embedded nodes into account when routing the packets in the NoCs. Accordingly, this paper proposes a trust-aware routing scheme, so-called TARN, to significantly reduce the rate of packet loss that can occur due to malicious behaviors of one or more nodes (or interconnects). Our distributed trust-aware path selection protocol bypasses malicious IPs and securely routes packets to their destination. Furthermore, we introduce a lowoverhead mechanism for delegating trust scores to neighboring routers, thereby enhancing network efficiency. Experimental results demonstrate significant improvements in packet loss while imposing low performance and energy overhead.
11:50 CEST	TS02.11	C2C: A FRAMEWORK FOR CRITICAL TOKEN CLASSIFICATION IN TRANSFORMER-BASED INFERENCE SYSTEMS Speaker: Sihyun Kim, KAIST, KR Authors: Myeongjae Jang, Jesung Kim, Haejin Nam, Sihyun Kim and Soontae Kim, KAIST, KR Abstract Because embedding vectors in a Transformer-based model represent crucial information about input texts, attacks or errors affecting them can cause severe accuracy degradation. We observe critical tokens for the first time, that determine the overall accuracy but their embedding vectors take only a small portion of the embedding table. Therefore, we propose a framework called C2C that classifies the critical tokens to facilitate their protection in a Transformer-based inference system with a small overhead. Using BERT with GLUE datasets, critical embedding vectors take only 13.8% of the embedding table. Compromising critical embedding vectors can reduce accuracy by up to 44.8% even if other parameters are not corrupted.
11:51 CEST	TS02.12	A DRAM-BASED PROCESSING-IN-MEMORY ACCELERATOR FOR PRIVACY-PROTECTING MACHINE LEARNING Speaker and Author: Bokyung Kim, Rutgers University, US Abstract The unprecedented success of deep neural networks (DNNs) has necessitated large-scale matrix processing. Correspondingly, machine learning (ML) accelerators have evolved for general matrix multiplication (GEMM), and the systolic array has been one of the most successful designs for GEMM. However, its efficiency is questioned in privacy-protecting DNNs of increasing demand, according to our observation. Particularly, differential privacy (DP) has been widely applied to DNN training to protect sensitive information and its unique computation, per-sample gradient norm, needs low-dimensional-tensor processing. Because of this mismatch, DP training shows dreadful efficiency in matrix-tailored systolic accelerators, repeating under-utilized array usage and redundant data transfer. Apart from the GEMM-optimized systolic architecture, this work proposes a vector-processing-oriented DRAM processing-in-memory (PIM) accelerator, DPIMA, for DP training. Leveraging the advantages of DRAM and PIM, we offer novelties on micro-architecture with full operation adder tree (FOAT) and systemic design with dataflow. Our experiments with various learning models demonstrate that DPIMA can achieve 13.8X and 123.8X improvement on average in performance and energy efficiency than the systolic baseline.

CFU Career Fair - University

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 12:00 CEST - 13:00 CEST
Location / Room: Rhône 1

Time	Label	Presentation Title Authors
12:00 CEST	CFU.1	POSTDOC - RECONFIGURABLE AI COMPUTING Presenter: Kevin Martin, University Bretagne Sud, FR Author: Kevin Martin, University Bretagne Sud, FR Abstract The AdaptING project proposes a new architectural paradigm called adaptive architecture, which aims to make HW adaptable to any given AI application and its constraints in terms of accuracy, energy, latency, and reliability. The adaptive architecture is designed to provide flexibility, efficiency, sustainability, and reliability for embedded AI. This approach goes beyond the current state-of-the-art HW architectures and targets the next generation of AI by investigating and designing flexible, efficient, sustainable, and reliable embedded AI on adaptive architectures. The ARCAD team in Lab-STICC benefits from 10+ years of knowledge and development of Coarse Grained Reconfigurable Architectures as programmable hardware accelerators along with its compilation tool. The goal of this post-doc position is to contribute to the design of reconfigurable and adaptable AI architectures, as part of the team composed of PhD students, engineers, and internships. The objectives of the post-doc are: (i) to explore a CGRA-based AI accelerator, (ii) to set up accordingly the compilation flow, (iii) to run experiments on applications from AI domain. The work will focus on the design of the reconfigurable architecture and the associated compilation tools.
12:06 CEST	CFU.2	POSTDOC - ADVANCING SECURE AI ACCELERATORS FOR EDGE COMPUTING Presenter: Alessandro Savino, Politecnico di Torino, IT Authors: Alessandro Savino and Stefano Di Carlo, Politecnico di Torino, IT Abstract This post-doc position offers an exciting opportunity to advance state-of-the-art edge computing by leveraging photonic computing architectures for AI. The rapid progress of artificial intelligence and its integration into edge devices demand novel solutions to overcome the inherent limitations of current systems in processing and security. Under the EU-funded NEUROPULS project, the research will focus on developing innovative training techniques—such as federated learning, knowledge distillation, and model quantization/pruning—tailored for photonic-based accelerators while addressing security challenges. Emphasis will be placed on mitigating vulnerabilities through photonic Physical Unclonable Functions (PUFs), thereby enhancing AI computation without reliance on traditional memory-stored secret keys. The aim is to create low-power, secure, and robust edge AI systems that demonstrate superior energy efficiency and performance, validated in real-world scenarios. This position is ideal for researchers passionate about bridging cutting-edge training methodologies with advanced security mechanisms in AI accelerators.
12:12 CEST	CFU.3	POSTDOC - SOCMATI - SOCIAL MEDIA AUTOMOTIVE CYBER THREAT INTELLIGENCE Presenter: Stefano Di Carlo, Politecnico di Torino, IT Authors: Stefano Di Carlo and Alessandro Savino, Politecnico di Torino, IT Abstract The primary goal of this postdoctoral research is to develop and validate a novel cyber threat intelligence platform explicitly tailored for the automotive sector based on preliminary research carried out in this field. The platform will leverage social media channels and advanced Large Language Models (LLMs) to identify and analyze both traditional IT-based cyber threats and physical tampering attacks, with particular attention to vehicle emission control components such as DPF and DEF systems.
12:18 CEST	CFU.4	POSTDOC, PHD - SECURE AND EFFICIENT MACHINE LEARNING ON EMERGING COMPUTING ARCHITECTURES Presenter: Hassan Nassar, Karlsruhe Institute of Technology, DE Authors: Hassan Nassar, Heba Khdr and Joerg Henkel, Karlsruhe Institute of Technology, DE Abstract The Chair for Embedded Systems invites applications for a Postdoctoral/Doctoral Researcher position focused on secure and efficient machine learning in the context of emerging computing architectures. As computing systems evolve toward decentralized, distributed, and memory-centric paradigms, new opportunities—and challenges—arise for deploying machine learning in environments characterized by constrained resources, heterogeneous components, and growing security and transparency demands. Emerging technologies such as non-volatile memories (NVM), in-/near-memory computing, and hardware/software co-designed ML frameworks offer promising pathways but require deeper exploration to be practically viable. This project will investigate how to build efficient and trustworthy ML-based systems on such architectures by harnessing advances in interpretable AI and in-memory computing, aiming to shape the future of intelligent embedded systems.
12:24 CEST	CFU.5	POSTDOC, PHD, MSC - INSTITUTE OF SOFTWARE, CHINESE ACADEMY OF SCIENCES Presenter: Xiang Ling, Institute of Software, Chinese Academy of Sciences, CN Author: Xiang Ling, Institute of Software, Chinese Academy of Sciences, CN Abstract We are recruiting Ph.D. students, MSc students, and postdoctoral researchers to join our team at the Institute of Software, Chinese Academy of Sciences (ISCAS), Beijing, China.
12:30 CEST	CFU.6	PHD - ANALOG IN-MEMORY BRAIN-INSPIRED RELIABLE AI ACCELERATORS Presenter: Moritz Fieback, TU Delft, NL Authors: Moritz Fieback, Anteneh Gebregiorgis, Rajendra Bishnoi and Said Hamdioui, TU Delft, NL Abstract Traditional von-Neumann architectures face bottlenecks in AI-driven edge computing due to the "memory-wall." Computation-in-memory (CIM) architectures using RRAM, FeFET, and STT-MRAM address this by performing operations within memory, improving efficiency. However, challenges like variability, drift, and read/write disturb impact reliability. Ensuring robust and accurate CIM accelerators is key to enabling energy-efficient AI at the edge. TU Delft offers two PhD positions on the design of these architectures, focussing on energy-efficiency and on reliability.
12:36 CEST	CFU.7	PHD - RELIABLE NEUROMORPHIC SYSTEM DESIGN Presenter: Anteneh Gebregiorgis, TU Delft, NL Authors: Anteneh Gebregiorgis, Moritz Fieback, Rajendra Bishnoi and Said Hamdioui, TU Delft, NL Abstract The Computer Engineering (CE) section of the Quantum & Computer Engineering (QCE) department is seeking a highly motivated PhD candidate to work on reliable and fault-tolerant architectures for future neuromorphic accelerators. The research will focus on identifying reliability failure mechanisms and non-ideality issues in various neuromorphic accelerators, such as emerging memory-based CIM and spiking neural networks. The candidate will develop solutions to mitigate these issues, implement fault-tolerance and self-healing mechanisms, and design configurable strategies for graceful degradation. This PhD position is part of the SNS (Self-Healing Neuromorphic Systems) project, a collaborative national initiative involving key academic and industrial partners, including TU Eindhoven, CWI, IMEC, and Delft University of Technology.
12:42 CEST	CFU.8	PHD - ENGINEERING FOR INTELLIGENT SYSTEMS Presenter: Nicola Bombieri, Università di Verona, IT Author: Nicola Bombieri, Università di Verona, IT Abstract The Ph.D. course aims to prepare professionals in the field of design and creation of physical, bioengineering and cyber-physical devices and systems that, also thanks to the use of artificial intelligence, allow an increasingly advanced, collaborative and sustainable interaction between the environment, people, and robots in their respective interaction contexts. Job description: We have six open positions for motivated candidates with a strong background in computer science and artificial intelligence to join our group! DEIS offers an international, diverse and team-based academic environment, with strong interactions among its researchers and excellent, long-standing research achievements in the field of intelligent systems and beyond. The research of DISI is strongly connected with world-leading industrial partners, combining fundamental research on core areas of Computational Architecture, Bioengineering, Physics, Artificial Intelligence, Robotics, Internet of Things and Cyber-Physical Systems with cutting-edge real-world industrial and healthcare applications.
12:48 CEST	CFU.9	PHD - IMEC Presenter: Inge Asselberghs, IMEC, BE Author: Inge Asselberghs, IMEC, BE Abstract Finished or about to finish your master studies? Looking for a first step in your research career? Curious to experience what it's like to work in a renowned research center that bridges academia and industry? Joining imec as a PhD student is exactly what you need to start your exciting research journey.

LK01 IEEE CEDA Lunchtime Panel: on the occasion of CEDA 20th anniversary

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 13:15 CEST - 14:00 CEST
Location / Room: Auditorium Pasteur

Session chair:
Georges Gielen, KU Leuven, BE

Supported by

13:15 CEST LK01.1 CEDA LUNCHTIME PANEL
Speaker:
Rolf Ernst, TU Braunschweig, DE; David Atienza, École Polytechnique Fédérale de Lausanne (EPFL), CH; Valeria Bertacco, University of Michigan, US; Luca Benini, ETH Zurich, CH | Università di Bologna, IT
Abstract
In celebration of IEEE CEDA's 20th anniversary, this panel discusses the role of electronic design automation (EDA) in designing today's multi-billion-transistor chips. With CMOS technology scaling approaching the range of a few nanometer and chips going 3D, what design techniques and tools will be needed to design these future integrated systems? Or will other technologies pop up and dominate? Where are the challenges? Who will provide the solutions? Will it be open source? Join us for this special IEEE CEDA anniversary panel to discuss these questions and share your insights.

ASD02 ASD focus session: Cybersecurity Challenges of Autonomous Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Roseraie 1&2

Organiser:
Sebastian Steinhorst, TU Munich, DE

With the recent dramatic increase in performance of artificial intelligence and related computing systems, together with advanced sensing, connectivity, and technological platforms, autonomous systems are poised to enter many application domains such as transportation and manufacturing. However, as autonomy increases, the risks of cybersecurity threats are equally rising, requiring the development of sophisticated methods on all layers of autonomous systems architectures. In this session, five experts from different areas of cybersecurity research in industry and academia will present challenges ranging from the physical layer to the system of systems layer of autonomous systems. Using the example of autonomous vehicles to highlight current developments, this session will discuss the efforts necessary to achieve secure and safe autonomous systems. The session will comprise individual 10-minute presentations of the five speakers, followed by a panel discussion that will involve the audience and further deepen the exchange.

Time	Label	Presentation Title Authors
14:00 CEST	ASD02.1	PHYSICAL LAYER INTEGRITY CHECKS Presenter: Mridula Singh, CISPA Helmholtz Center for Information Security, DE Author: Mridula Singh, CISPA Helmholtz Center for Information Security, DE Abstract As sensor technology continues to evolve, we can expect even more sophisticated features to make driving more convenient and significantly reduce the risk of accidents and vehicle theft. However, when using insecure sensors, these systems inadvertently create new vulnerabilities. Many of these security vulnerabilities cannot be fixed at the data layer and require physical layer integrity checks to enable secure sensing. In this talk, I will use the Passive Keyless Entry and Start System (PKES) as an example to illustrate the importance of physical layer security and how the lessons learned from designing secure PKES can be applied to other sensors and communication systems.
14:10 CEST	ASD02.2	ETHERNETIFICATION OF CAN Presenter: Alexander Zeh, Infineon Technologies, DE Author: Alexander Zeh, Infineon Technologies, DE Abstract We propose CAN adaption layer called CANAL to harmonize data and control plane protocols with existing Ethernet (security) protocols for In-Vehicle Networks (IVNs).
14:20 CEST	ASD02.3	SELF-SOVEREIGN IDENTITIES FOR SOFTWARE-DEFINED VEHICLES Presenter: Christian Prehofer, DENSO Automotive, DE Author: Christian Prehofer, DENSO Automotive, DE Abstract Software-Defined Vehicles (SDV) predominantly enable their functions and features through software. We address the security and trust challenges for software-defined vehicles, based on several example use cases. We elaborate requirements and challenges, including identity, data integrity and trust relationships. Then we argue why the new W3C standard for self-sovereign, decentral identifiers addresses the needs of software-defined vehicles.
14:30 CEST	ASD02.4	CAN WE ACHIEVE ACCEPTABLE SECURITY FOR AUTONOMOUS SYSTEMS? Presenter: Mikael Asplund, Linköping University, SE Author: Mikael Asplund, Linköping University, SE Abstract In this short talk I will highlight some of the issues of how we approach cybersecurity when developing new systems in general and safety-critical autonomous systems in particular. Drawing on examples from recent incidents as well as insights from the research frontier in verifiable security, I argue that we need to take a different approach than what is currently the state of practice. This requires questioning some of the fundamental assumptions of modern system engineering where market forces and competition is currently prioritized over transparency, robustness and verifiability.
14:40 CEST	ASD02.5	MANAGING CYBERSECURITY IN THE AUTONOMOUS VEHICLE MOBILITY-AS-A-SERVICE SYSTEM-OF-SYSTEMS Presenter: Tobias Löhr, P3 automotive GmbH, DE Author: Tobias Löhr, P3 automotive GmbH, DE Abstract The deployment of SAE Level 4 (L4) Autonomous Vehicles (AVs) for Mobility as a Service (MaaS) ride-hailing services in the US, Europe, and China has underscored the significant technological advancements achieved in autonomous driving over the past two years. Unlike traditional automotive solutions, these systems feature a complex architecture comprising a retrofitted 'vehicle operating system', a 'self-driving stack', and a 'passenger operating system', which serves as the gateway to the MaaS platform and manages passenger interactions within the car. This leads to an interconnected, interdependent, 'multimodal' architecture; however, it also engenders new challenges in ensuring cybersecurity due to an expanded lifecycle perspective that extends from the development phase through the operational phase to the end of service for AVs. The talk explores strategies to manage the extended supplier & partner networks and apply a unified cybersecurity development and validation approach. It further examines methods to secure multiple entry points or gateways within this system of systems architecture. It introduces methodologies for developing and assuring robust cybersecurity measures across the entire supplier value chain.
14:50 CEST	ASD02.6	PANEL DISCUSSION Presenter: All the Panelists, DATE 2025, FR Author: All the Panelists, DATE 2025, FR Abstract .

BPA02 BPA Session 2

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: St Clair 3AB

Session chair:
Alberto Bosio, Université de Lyon, FR

Session co-chair:
Graziano Pravadelli, Univerità di Verona, IT

Time	Label	Presentation Title Authors
14:00 CEST	BPA02.1	QGDP: QUANTUM LEGALIZATION AND DETAILED PLACEMENT FOR SUPERCONDUCTING QUANTUM COMPUTERS Speaker: Junyao Zhang, Duke University, US Authors: Junyao Zhang¹, Guanglei Zhou¹, Feng Cheng¹, Jonathan Ku¹, Qi Ding², Jiaqi Gu³, Hanrui Wang⁴, Hai (Helen) Li¹ and Yiran Chen¹ ¹Duke University, US; ²Massachusetts Institute of Technology, US; ³Arizona State University, US; ⁴University of California, Los Angeles, US Abstract Quantum computers (QCs) are currently limited by qubit numbers. A major challenge in scaling these systems is crosstalk, which arises from unwanted interactions among neighboring components such as qubits and resonators. An innovative placement strategy tailored for superconducting QCs can systematically address crosstalk within limited substrate areas. Legalization is a crucial stage in placement process, refining post-global-placement configurations to satisfy design constraints and enhance layout quality. However, existing legalizers are not supported to legalize quantum placements. We aim to address this gap with qGDP, developed to meticulously legalize quantum components by adhering to quantum spatial constraints and reducing resonator crossing to alleviate various crosstalk effects. Our results indicate that qGDP effectively legalizes and fine-tunes the layout, addressing the quantum-specific spatial constraints inherent in various device topologies. By evaluating diverse benchmarks. qGDP consistently outperforms state-of-the-art legalization engines, delivering substantial improvements in fidelity and reducing spatial violation, with average gains of 34.4x and 16.9x, respectively
14:20 CEST	BPA02.2	RVEBS: EVENT-BASED SAMPLING ON RISC-V Speaker: Tiago Rocha, INESC-ID, Instituto Superior Técnico, University of Lisbon, Portugal, Pl Authors: Tiago Rocha¹, Nuno Neves², Nuno Roma², Pedro Tomás³ and Leonel Sousa⁴ ¹INESC-ID, PT; ²INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, PT; ³INESC-ID, Instituto Superior Técnico, PT; ⁴INESC-ID \| Universidade de Lisboa, PT Abstract As RISC-V ISA continues to gain traction for both embedded and high-performance computing, the demand for advanced monitoring tools has become critical to fine-tuning the applications' performance. Current RISC-V hardware performance monitors already provide basic event counting but lack sophisticated features like event-based sampling, which are available in more established architectures such as x86 and ARM. This paper presents the first RISC-V Event-Based Sampling (RVEBS) system for comprehensive performance monitoring and application profiling. The proposed system builds upon existing RISC-V specifications, incorporating necessary modifications to enable the desired functionality. It also presents an OpenSBI extension to provide privileged software access to newly implemented control status registers that manage the sampling process. An implementation use case based on an OpenPiton processor featuring a CVA6 core on 28nm CMOS technology was presented. The results indicate that the proposed scheme is lightweight, highly accurate, and does not impact the processor's critical path while maintaining minimal impact on overall application performance.
14:40 CEST	BPA02.3	XRAY: DETECTING AND EXPLOITING VULNERABILITIES IN ARM AXI INTERCONNECTS Speaker: Melisande Zonta, ETH Zurich, CH Authors: Melisande Zonta, Nora Hinderling and Shweta Shinde, ETH Zurich, CH Abstract The Arm AMBA Advanced eXtensible Interface (AXI) interconnect is a critical IP in FPGA-based designs. While AXI and interconnect designs are primarily optimized for performance, their security requires closer investigation—any bugs in these components can potentially compromise critical IPs like processing systems and memory. To this end, Xray systematically analyzes AXI interconnects. Specifically, it treats the AXI interconnect as a transaction processing block that is expected to adhere to certain properties (e.g., bus and data isolation, progress). Then, Xray employs a traffic generator that creates transaction workloads with the aim of triggering violations in the AXI interconnects. As the last piece of the puzzle, Xray wrappers automatically flag transaction traces as either compliant, errors, or warnings. Put together, Xray comprises 13 properties, has been tested on 7 interconnects, identifies 41 violations corresponding to 41 vulnerabilities. When compared to existing approaches such as verification IPs (VIPs) and protocol checkers from commercial tools, Xray identifies 19 known and 22 new violations. We show the security impact of Xray by sampling 5 Xray violations to construct 3 proof-of-concept exploits on realistic scenarios deployed on FPGA to leak intermediate data, drop transactions, and corrupt memory.

FS08 Focus session - The European Chips Act: Ready to Take-Off

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Rhône 3AB

Session chair:
Anton Klotz, Fraunhofer, DE

Session co-chair:
Pascal Vivet, CEA, FR

EU Chips Act is the biggest EU initiative to support European microelectronics industry. After it entered into force on 21. September 2023, first calls have been issued in 2024. It is time to take a look at the progress which has been made in the past two years and take an outlook what lies ahead in 2025 and following years. Our panelists are representing various activities of the EU Chips Act, we have the head of Chips JU, the representatives of the pilot lines and Virtual Design Platform initiative. After impulse presentations, there will be a panel discussion, where the panelists will answer the questions from the audience on the EU Chips Act.

Time	Label	Presentation Title Authors
14:05 CEST	FS08.1	CHIPS JOINT UNDERTAKING: THE CHIPS ACT AND THE CHIPS JU Presenter: Jari Kinaret, CHIPS JU, BE Author: Jari Kinaret, CHIPS JU, BE Abstract .
14:15 CEST	FS08.2	FAMES PILOT LINE OVERVIEW AND TECHNOLOGY INSIGHTS Presenter: Bruno Paing, CEA-Leti, FR Author: Bruno Paing, CEA-Leti, FR Abstract .
14:25 CEST	FS08.3	INNOVATION AND COLLABORATION THROUGH THE NANOIC PILOT LINE Presenter: Inge Asselberghs, IMEC, BE Author: Inge Asselberghs, IMEC, BE Abstract .
14:35 CEST	FS08.4	APECS PILOT LINE OVERVIEW Presenter: Amelie Hagelauer, Fraunhofer EMFT, DE Author: Amelie Hagelauer, Fraunhofer EMFT, DE Abstract .
14:45 CEST	FS08.5	THE EUROPEAN CHIP DESIGN PLATFORM Presenter: Helio Fernández Téllez, IMEC, BE Author: Helio Fernández Téllez, IMEC, BE Abstract .
14:55 CEST	FS08.6	PANEL DISCUSSION Presenter: All the Panelists, DATE 2025, FR Author: All the Panelists, DATE 2025, FR Abstract .

LKS02 Later … with the keynote speakers

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Auditorium Pasteur

TS03 Embedded software architecture, compilers and tool chains

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Salon Pasteur

Session chair:
Timothy Bourke, INRIA, FR

Session co-chair:
Anup Kumar Das, Drexel University, US

Time	Label	Presentation Title Authors
14:00 CEST	TS03.1	MPFS: A SCALABLE USER-SPACE PERSISTENT MEMORY FILE SYSTEM FOR MULTIPLE PROCESSES Speaker: Bo Ding, Huazhong University of Science and Technology, CN Authors: Bo Ding, Wei Tong, Yu Hua, Yuchong Hu, Zhangyu Chen, Xueliang Wei, Qiankun Liu, Dong Huang and Dan Feng, Huazhong University of Science and Technology, CN Abstract Persistent memory (PM) leveraging memory-mapped I/O (MMIO) delivers superior I/O performance, leading to the development of user-space PM file systems based on MMIO. While effective in single-process scenarios, these systems encounter challenges in multi-process environments, such as performance degradation due to repeated page faults and cross-process synchronizations, as well as a large memory footprint from duplicated paging structures. To address these problems, we propose a Multi-process PM File System (MPFS). MPFS builds a shareable page table and shares it among processes, avoiding building duplicate paging structures for distinct processes, thereby significantly reducing the software overhead and memory footprint caused by repeated page faults. MPFS further proposes a PGD-aligned (512GB) mapping method to accelerate page table sharing. Furthermore, MPFS provides a cross-process memory protection mechanism based on the PGD-aligned mapping, ensuring multi-process data reliability with negligible overheads. The experimental results show that MPFS outperforms existing user-space PM file systems by 1560% in multi-process scenarios.
14:05 CEST	TS03.2	EILID: EXECUTION INTEGRITY FOR LOW-END IOT DEVICES Speaker: Youngil Kim, University of California, Irvine, US Authors: Sashidhar Jakkamsetti¹, Youngil Kim², Andrew Searles² and Gene Tsudik² ¹Bosch Research, US; ²University of California, Irvine, US Abstract Prior research yielded many techniques to mitigate software compromise for low-end Internet of Things (IoT) devices. Some of them detect software modifications via remote attestation and similar services, while others preventatively ensure software (static) integrity. However, achieving run-time (dynamic) security, e.g., control-flow integrity (CFI), remains a challenge. Control-flow attestation (CFA) is one approach that minimizes the burden on devices. However, CFA is not a real-time countermeasure against run-time attacks since it requires communication with a verifying entity. This poses significant risks if safety- or time-critical tasks have memory vulnerabilities. To address this issue, we construct EILID – a hybrid architecture that ensures software execution integrity by actively monitoring control-flow violations on low-end devices. EILID is built atop CASU, a prevention-based (i.e., active) hybrid Root-of-Trust (RoT) that guarantees software immutability. EILID achieves fine-grained backward-edge and function-level forward-edge CFI via semi-automatic code instrumentation and a secure shadow stack.
14:10 CEST	TS03.3	DANCER: DYNAMIC COMPRESSION AND QUANTIZATION ARCHITECTURE FOR DEEP GRAPH CONVOLUTIONAL NETWORK Speaker: Yi Wang, Shenzhen University, CN Authors: Yunhao Dong, Zhaoyu Zhong, Yi Wang, Chenlin Ma and Tianyu Wang, Shenzhen University, CN Abstract Graph Convolutional Networks (GCNs) have been widely applied in fields such as social network analysis and recommendation systems. Recently, deep GCNs have emerged, enabling the exploration of deeper hidden information. Compared to traditional shallow GCNs, deep GCNs feature significantly more layers, leading to considerable computational and data movement challenges. Processing-In-Memory (PIM) offers a promising solution for efficiently handling GCNs by enabling near-data computation, thus reducing data transfer between processing units and memory. However, previous work mainly focused on shallow GCNs and has shown limited performance with deep GCNs. In this paper, we present Dancer, an innovative PIM-based GCN accelerator. Dancer optimizes data movement during the inference process, significantly improving efficiency and reducing energy consumption. Specifically, we introduce a novel compressed graph storage architecture and a dynamic quantization technique to minimize data transfers at each layer of the GCN. Additionally, through a detailed analysis of weight dynamics changes, we propose a sparsity propagation strategy to further alleviate the computational and data transfer burden between layers. Experimental results demonstrate that, compared to current state-of-the-art methods, Dancer achieves 3.7× speedup, 7.6× energy efficiency, and reduces of 9.6× DRAM access on average.
14:15 CEST	TS03.4	LOOPLYNX: A SCALABLE DATAFLOW ARCHITECTURE FOR EFFICIENT LLM INFERENCE Speaker: Jianing Zheng, Sun Yat-sen University, CN Authors: Jianing Zheng and Gang Chen, Sun Yat-sen University, CN Abstract In this paper, we propose LoopLynx, a scalable dataflow architecture for efficient LLM inference that optimizes FPGA usage through a hybrid spatial-temporal design. The design of LoopLynx incorporates a hybrid temporal-spatial architecture, where computationally intensive operators are implemented as large dataflow kernels. This achieves high throughput similar to spatial architecture, and organizing and reusing these kernels in a temporal way together enhances FPGA peak performance. Furthermore, to overcome the resource limitations of a single device, we provide a multi-FPGA distributed architecture that overlaps and hides all data transfers so that the distributed accelerators are fully utilized. By doing so, LoopLynx can be effectively scaled to multiple devices to further explore model parallelism for large-scale LLM inference. Evaluation of GPT-2 model demonstrates that LoopLynx can achieve comparable performance to state-of-the-art single FPGA-based accelerations. In addition, compared to Nvidia A100, our accelerator with a dual-FPGA configuration delivers a 2.52x speed-up in inference latency while consuming only 48.1% of the energy.
14:20 CEST	TS03.5	REMAPCOM: OPTIMIZING COMPACTION PERFORMANCE OF LSM TREES VIA DATA BLOCK REMAPPING IN SSDS Speaker: Yi Fan, Wuhan University of Technology, CN Authors: Yi Fan¹, Yajuan Du¹ and Sam H. Noh² ¹Wuhan University of Technology, CN; ²UNIST, KR Abstract In LSM-based KV stores, typically deployed on systems with DRAM-SSD storage, compaction degrades write performance and SSD endurance due to significant write amplification. To address this issue, recent proposals have mostly focused on redesigning the structure of LSM trees. In this paper, we observe the prevalence of data blocks that are are simply read and written back without being altered during the LSM tree compaction process, which we refer to as Unchanged Data Blocks (UDBs). These UDBs are source of unnecessary write amplification leading to performance degradation and shortening of SSD lifetime. To address this duplication issue, we propose a remapping-based compaction method, which we call RemapCom. RemapCom considers the identification and retention by designing a lightweight state machine to track the status of the KV items in each data block as well as designing a UDB retention strategy to prevent data blocks from being split due to adjacent intersecting blocks. We implement a prototype of RemapCom on LevelDB by providing two primitives for the remapping. Compared to the state-of-the-art, evaluation results demonstrate that RemapCom can reduce the write amplification by up to 53%.
14:25 CEST	TS03.6	A PRACTICAL LEARNING-BASED FTL FOR MEMORY-CONSTRAINED MOBILE FLASH STORAGE Speaker: Zelin Du, The Chinese University of Hong Kong, HK Authors: Zelin Du¹, Kecheng HUANG¹, Tianyu Wang², Xin Yao³, Renhai Chen⁴ and Zili Shao¹ ¹The Chinese University of Hong Kong, HK; ²Shenzhen University, CN; ³Huawei Inc, HK; ⁴Huawei Inc, CN Abstract The rapidly growing mobile market is pushing flash storage manufacturers to expand capacity into the terabyte range. However, this presents a significant challenge for mobile storage management: more logical-to-physical page mappings are desired to be efficiently managed and cached while the available caching space is extremely limited. This motivates us to shift toward a new learning-based paradigm: rather than maintaining mappings for individual pages, the learning-based approach can represent mapping relationships for a set of continuous pages. However, to construct linear models, existing methods that either consume the already-limited memory space or reuse flash garbage collection demonstrate poor model construction capabilities or significantly degrade flash performance, making them impractical for real-world use. In this paper, we propose LFTL, a practical, learning-based on-demand flash translation layer design for flash management in mobile devices. In contrast to prior work that centered around gathering sufficient mappings for linear model construction, our key insight is that linear patterns can be extracted and refined by leveraging the orderly, LPA-aligned write stream typical of mobile devices. By doing this, highly accurate linear models can be constructed regardless of the constraints of mobile device's cache limitation. We have implemented a fully functional prototype of LFTL based on FEMU. Our evaluation results show that LFTL shows preferable adaptability to memory-constrained storage devices compared to state-of-the-art learning-based approaches.
14:30 CEST	TS03.7	CONZONE: A ZONED FLASH STORAGE EMULATOR FOR CONSUMER DEVICES Speaker: Dingcui Yu, East China Normal University, CN Authors: Dingcui Yu, Jialin Liu, Yumiao Zhao, Wentong Li, Ziang Huang, Zonghuan Yan, Mengyang Ma and Liang Shi, East China Normal University, CN Abstract Considering the potential benefits to lifespan and performance, zoned flash storage is expected to be incorporated into the next generation of consumer devices. However, due to the limited volatile cache and heterogeneous flash cells of consumergrade flash storage, adopting a zone abstraction requires additional internal hardware design to maximize its benefits. To understand and efficiently improve the hardware design on consumer-grade zoned flash storage, we present ConZone—the first emulator tailored to the characteristics of consumer-grade zoned flash storage. Users can explore the internal architecture and management strategies of consumer-grade zoned flash storage and integrate the optimization with software. We validate the accuracy of ConZone by realizing a hardware architecture for consumer-grade zoned flash storage and comparing it with the state-of-the-art. We also make a case study for read performance research with ConZone to explore the design of mapping mechanisms and cache management strategies.
14:35 CEST	TS03.8	A HARDWARE-ASSISTED APPROACH FOR NON-INVASIVE AND FINE-GRAINED MEMORY POWER MANAGEMENT IN MCUS Speaker: Michael Kuhn, University of Tübingen, DE Authors: Michael Kuhn, Patrick Schmid and Oliver Bringmann, University of Tübingen, DE Abstract The energy demand of embedded systems is crucial and typically dominated by the memory subsystem. Off-the-shelf MCU platforms usually offer a wide range of memory configurations in terms of overall memory size, which may differ in the number of memory banks provided. Split memory banks have the potential to optimize energy demand, but this often remains unused in available hardware due to a lack of power management support or require significant manual effort to leverage the benefits of split-banked memory architectures. This paper proposes an approach to solve the challenge of integrating fine-grained power management support automatically, by a combined hardware/software solution for future off-the-shelf platforms. We present a method to efficiently search for an optimized code and data mapping onto the modules of split memory banks to maximize the idle times of all memory modules. To non-invasively put memory modules into sleep mode, a PC-driven power management controller (PMC) autonomously triggers transitions between power modes during embedded software execution. The evaluation of our optimization flow demonstrates that memory mappings can be explored in seconds, including the generation of the necessary PMC configuration and linker scripts. The application of PC-driven power management enables active memory modules to remain in light sleep mode for approximately 13% to 86% of the execution time, depending on the workload and memory configuration. This results in overall power savings of up to 24% in the memory banks, in terms of static and dynamic power.
14:40 CEST	TS03.9	TKD: AN EFFICIENT DEEP LEARNING COMPILER WITH CROSS-DEVICE KNOWLEDGE DISTILLATION Speaker: Chaoyao Shen, Southeast University, CN Authors: Yiming Ma, Chaoyao Shen, Linfeng Jiang, Tao Xu and Meng Zhang, Southeast University, CN Abstract Generating high-performance tensor programs on resource-constrained devices is challenging for current Deep Learning (DL) compilers that use learning-based cost models to predict the performance of tensor programs. Due to the inability of cost models to leverage cross-device information, it is extremely time-consuming to collect data and train a new cost model. To address this problem, this paper proposes TKD, a novel DL compiler that can be efficiently adapted to devices that are resource-constrained. TKD reduces the time budget by over 11x through an adaptive tensor program filter that eliminates redundant and unimportant measurements of tensor programs. Furthermore, by refining the cost model architecture with a multi-head attention module and distilling transferable knowledge from source devices, TKD outperforms state-of-the-art methods in prediction accuracy, compilation time, and compilation quality. We conducted experiments on the edge GPU, NVIDIA Jetson TX2, and the results show that compared to TenSet and TLP, TKD reduces compilation time by 1.58x and 1.16x, while achieving 1.40x and 1.27x speedups of the tensor programs, respectively.
14:45 CEST	TS03.10	DISPEED: DISTRIBUTING PACKET FLOW ANALYSES IN A SWARM OF HETEROGENEOUS EMBEDDED PLATFORMS Speaker: Louis Morge-rollet, ENSTA Institut Polytechnique de Paris, FR Authors: Louis Morge-Rollet¹, Camelia Slimani², Laurent Lemarchand³, Frédéric Leroy⁴, Jalil Boukhobza⁵ and David Espes³ ¹ENSTA-Bretagne, FR; ²ENSTA Bretagne, FR; ³University Brest, FR; ⁴ENSTA-bretagne, FR; ⁵ENSTA-Bretagne Lab-STICC, FR Abstract Security is a major challenge in swarm of drones. Network intrusion detection systems (IDS) are deployed to analyze and detect suspicious packet flows. Traditionally, they are implemented independently on each drone. However, due to heterogeneity and resource limitations of drones, IDS algorithms can fall short in satisfying Quality of Service (QoS) metrics, such as latency and accuracy. We argue that a drone can make profit from the swarm by delegating part of the analysis of their packet flows to neighbor drones that have more processing power to enforce security. In this paper, we propose two solving methods to distribute the packet flows to analyze among drones in a way to ensure that it is processed with a minimum communication overhead to limit the attack surface, while ensuring QoS metrics imposed by the drone mission. First, we propose a formulation of the distribution problem using both an Integer Linear Programming (ILP) and a Maximum-Flow Minimum-Cost (MFMC). Furthermore, we propose two specific solving methods for the distribution problem: (1) a Greedy Heuristic (GH), a non-exact solving method, but with small time overhead, and (2) an Adapted Edmonds-Karp (AEK) algorithm, an exact method, but with a higher time overhead. GH proved to be a very fast solution (up to more than 2000x faster than ILP with Branch and Bound), while AEK solution proved to find the exact solution even when the problem is very difficult.
14:50 CEST	TS03.11	ONE GRAY CODE FITS ALL: OPTIMIZING ACCESS TIME WITH BI-DIRECTIONAL PROGRAMMING FOR QLC SSDS Speaker: Tianyu Wang, Shenzhen University, CN Authors: Shaoqi Li¹, Tianyu Wang¹, Yongbiao Zhu¹, Chenlin Ma¹, Yi Wang¹, Zhaoyan Shen² and Zili Shao³ ¹Shenzhen University, CN; ²Shandong University, CN; ³The Chinese University of Hong Kong, HK Abstract Gray code, a voltage-level-to-data-bit translation scheme, is widely used in QLC SSDs. However, it causes the four data bits in QLC to exhibit significantly different read and write performance with up to 8x latency variation, severely impacting the worst-case performance of QLC SSDs. This paper presents BDP, a novel Bi-Directional Programming scheme. Based on a fixed Gray code, BDP combines both the normal (forward) and reverse programming directions to enable runtime programming direction arbitration. Experimental results show that BDP can effectively improve the read and write performance of SSD compared to representative schemes.

UF University Fair & Student Teams Fair

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Rhône 1

This sessions hosts two initiatives of the Young People Programme: the University fair (formerly known as University Booth) and the Student Teams Fair. The University Fair is a platform for disseminating mature academic research activities, particularly demonstrations and prototypes that are ready for transfer to other units in academia or industry. This year, the university fair features seven demos covering security, RISC-V, chiplets, autonomous driving and more. The Student Teams Fair is an opportunity for selected university student teams involved in international competitions to present their success stories and challenges, and to receive support from companies and DATE attendees. Do not miss to check this event with diverse exciting works on display.
List of student teams:
UGent Racing, https://www.ugentracing.be/
RoboTo, https://www.teamroboto.it/
KITcar, https://kitcar-team.de/
H2politO, https://areeweb.polito.it/didattica/h2polito/
Polimi Motorcycle Factory, https://www.polimimotorcyclefactory.it/

Time	Label	Presentation Title Authors
14:00 CEST	UF.1	DEMO OF A VLSI INTEGRATION OF A PHYSICAL UNCLONABLE FUNCTION AS IDENTIFIER AND KEY GENERATOR Presenter: Pau Ortega-Castro, Instituto de Microelectronica de Sevilla, IMSE-CNM, CSIC, US, ES Authors: Pau Ortega-Castro, Macarena Martinez-Rodriguez and Piedad Brox, Instituto de Microelectronica de Sevilla, IMSE-CNM, CSIC, US, ES Abstract This demonstrator showcases the functionality of a RO-based PUF and Helper Data Algorithm (HDA) modules, which are part of a lightweight Root-of-Trust (RoT) designed and integrated with TSMC 65nm Low Power technology. The PUF generates unique device-specific IDs by exploiting manufacturing variabilities, crucial for secure key management. The demo setup features a custom PCB and a system-on-chip interfaced via a Pynq board, demonstrating functionalities such as ID generation and, key obfuscation and retrieval. It highlights the ability to securely retrieve keys within the same device and prevent unauthorized access to others.
14:03 CEST	UF.2	TVLA ASSESSMENT AND PROPOSED COUNTERMEASURES ON THE HARDWARE IMPLEMENTATION OF EDDSA25519 Presenter: Pablo Navarro-Torrero, Instituto de Microelectronica de Sevilla (IMSE-CNM), CSIC-US, ES Authors: Pablo Navarro-Torrero, Eros Camacho-Ruiz, Macarena C. Martinez-Rodriguez and Piedad Brox, Instituto de Microelectronica de Sevilla (IMSE-CNM), CSIC-US, ES Abstract One of the objectives of the SPIRS project is to design a hardware Root-of-Trust (RoT). As part of this effort, the EdDSA25519 algorithm, used for digital signatures, has been integrated into the SPIRS RoT. The initial implementation was a hardware accelerator for elliptic curve operations. Based on these results, the team has now developed a complete hardware implementation of the EdDSA25519 algorithm. This new implementation outperforms its software counterpart, achieving speeds approximately 5 to 10 times faster. However, it is essential to address its resistance against Side-Channel Attacks (SCAs). To this end, the demo will perform power analysis on the EdDSA25519 hardware module. Results obtained using the Test Vector Leakage Assessment (TVLA) methodology highlight potential data leakage and demonstrate how proposed hardware countermeasures significantly mitigate this risk.
14:06 CEST	UF.3	RISC-V HARDWARE AND SOFTWARE DESIGNS FOR AUTONOMOUS DRIVING LEVEL 4 Presenter: Eric Rufart Blasco, Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC), ES Authors: Eric Rufart Blasco¹, Marc Solé Bonet², Jannis Wolf², Guillermo Vidal¹, Josué Pedrajas¹, Pau Lopez Castillón¹, Xavier Caricchio¹, Juan Carlos Rodriguez², Matina Maria Trompouki² and Leonidas Kosmidis³ ¹Universitat Politècnica de Catalunya (UPC) and Barcelona Supercomputing Center (BSC), ES; ²BSC, ES; ³UPC \| BSC, ES Abstract Autonomous Driving for SAE Level 4, i.e. high driving automation in which a vehicle is self-driving under most circumstances, requires an increased performance capability, which can only be provided by complex hardware architectures, involving multicores, Artificial Intelligence (AI) accelerators and Graphics Processing Units (GPUs) capable of general purpose processing. At the same time, non-functional properties such as timing predictability, functional safety certification according to ISO 26262, worst case execution time (WCET) computation and programmability are key challenges, which need to be addressed by their hardware and software. In this University Fair submission, we will showcase hardware and software demonstrators of our institution, Barcelona Supercomputing Center and Universitat Politècnica de Catalunya in this research line, which are based on RISC-V and could replace in the future commercial black box solutions for autonomous driving.
14:09 CEST	UF.4	DAEDALUS: A DESIGN-CENTRIC STRATEGY FOR 3D CHIPLET FLOORPLANNING Presenter: Sebastiano Gaiardelli, Università di Verona, IT Authors: Sebastiano Gaiardelli¹, Michele Lora¹, Marco Cignarella², Anna Fontanelli², Francesco Rossi³, Michele Taliercio² and Franco Fummi¹ ¹Università di Verona, IT; ²Monozukuri S.p.A, IT; ³Monozukuri S.p.A, Abstract Three-dimensional integrated circuits (3D-ICs) promise high-performance, cost-effective computing systems, enabling the vertical stacking of multiple chiplets, and reducing areas and interconnection lengths. Yet, finding the optimal floorplan for these systems is particularly challenging. We introduce Daedalus, a framework that efficiently explores optimal configurations by enforcing multiple design constraints. Experimental results showcase its effectiveness in tackling increasingly complex floorplanning instances.
14:12 CEST	UF.5	ARBOARD, THE FUTURE OF PCB DEBUG Presenter: Giorgio Insinga, Politecnico di Torino, IT Authors: Giorgio Insinga, Pietro Bella and Paolo Bernardi, Politecnico di Torino, IT Abstract ARBoard is an augmented reality application that aims to simplify the process of debugging and validating complex printed circuit boards (PCBs) during pre-market phases and during integration when a user purchases an off-the-shelf board. The debugging phase plays a fundamental role in the development cycle of a printed circuit board and represents one of the major expenses for manufacturing companies. Although most tests are automated during mass production of PCBs once the board and component design has been finalized, numerous manual validation and testing operations are essential and irreplaceable during prototyping and subsequent technological ramp-up phases. These debugging and validation operations are based on board inspection through contacting "probe" or "test" points. The engineer or technician in charge of these operations must find the right position within the PCB and its maze of components and connectors. While this operation of identifying appropriate probe points might seem elementary, the necessary documentation includes the following board-related documents: the schematics and the placement files (also called silkscreen). To identify the correct point to probe on the physical board, the user must cross-reference information from these inputs. This process can require considerable effort, is particularly prone to human error, and is notoriously frustrating for personnel. The objective difficulties can further worsen if, as often happens, one needs to keep track of multiple signals simultaneously, resulting in time loss and reduced efficiency for the operator who risks testing the wrong signals. A typical setup for PCB debugging and firmware validation includes an operator that "probes" a signal from the board to monitor it via an oscilloscope. The objective of the proposed solution is to create an augmented reality-based application that can simplify PCB debugging operations by providing users with both automatic processing tools for schematics and silkscreen placement files and the ability to project appropriate markers onto the actual board for identifying PCB components, including integrated circuits (chips), connectors, test pads, etc. With ARBoard, virtual screens allow component selection by clicking on either the schematic file or placement silkscreen. A marker will appear on the selected component both on the file visualization and on the physical board. To achieve this goal, components of interest are extracted directly from the board's public documentation. Through machine learning algorithms, components are recognized from placement files and associated with their code names. The same components are recognized in the board schematics. The augmented reality application allows users to select elements of interest simply by clicking on their representation in the schematic or through a text search, and to visualize them through luminous and textual indicators overlaid on the board in the real world. Compared to competitors, this application does not require sensitive data such as PCB design files and does not require special setup or expensive test machines, but rather the use of an intuitive augmented reality application. Thanks to the proposed approach, test engineers will be much more efficient in their work, and human errors during testing will be minimized, resulting in reduced time-to-market for PCBs and reduced costs for manufacturing companies.
14:15 CEST	UF.6	VASCO: AN ASIC TEST VEHICLE FOR HARDWARE SECURITY RESEARCH Presenter: Mikael Carmona, CEA-Leti, FR Authors: Mikael Carmona¹, Stefano Di Matteo², Florent Lepin³, Manuel Pezzin³, Emanuele Valea⁴ and Romain Wacquez¹ ¹CEA-Leti, FR; ²CEA, FR; ³CEA-List, FR; ⁴CEA LIST, FR Abstract As cybersecurity threats evolve, hardware security has become a critical concern, especially with the rise of Post-Quantum Cryptography (PQC) and the growing adoption of RISC-V secure microarchitectures. To address these challenges, CEA has developed VASCO, an ASIC platform for designing, implementing, and evaluating secure hardware primitives. VASCO enables the development of countermeasures against physical attacks and provides a unique on-silicon approach to validate True Random Number Generators (TRNGs), Physical Unclonable Functions (PUFs), and PQC accelerators. This demo will showcase VASCO#2, fabricated in 2023, and offer a preview of VASCO#3, expected in 2025.
14:18 CEST	UF.7	SECURING CIRCUITS AGAINST OPTICAL PROBING ATTACK Presenter: Sajjad Parvin, University of Bremen, DE Authors: Sajjad Parvin¹, Frank Sill Torres² and Rolf Drechsler³ ¹University of Bremen, DE; ²German Aerospace Center, DE; ³University of Bremen \| DFKI, DE Abstract Recently, a non-invasive laser-based Side-Channel Analysis (SCA) attack, namely Optical Probing (OP) attack has been shown to pose an immense threat to the security of sensitive information on chips. In the literature, several countermeasures are proposed which require changes in the chip fabrication process. As a result, these countermeasures are costly and not economical. In this work, we focused on proposing countermeasures against OP at the layout level. Moreover, to design robust chips against OP during design time, we developed an OP simulator. The developed OP simulator enables designers to automate analyzing a design pre-silicon's information leakage. The structures on this chip will be used to evaluate and improve the accuracy of our OP simulator under an experimental OP setup.
14:21 CEST	UF.8	STUDENT TEAMS FAIR Presenter: Sara Vinco, Politecnico di Torino, IT Author: Sara Vinco, Politecnico di Torino, IT Abstract Presentation of the student teams selected for the Student Teams Fair: UGent Racing, https://www.ugentracing.be/ RoboTo, https://www.teamroboto.it/ KITcar, https://kitcar-team.de/ H2politO, https://areeweb.polito.it/didattica/h2polito/ Polimi Motorcycle Factory, https://www.polimimotorcyclefactory.it/
14:30 CEST	UF.9	POSTER AND DEMO SESSION Presenter: All participants, DATE, FR Author: All participants, DATE, FR Abstract .

W01 Eco-ES: Eco-design and circular economy of Electronic Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 18:00 CEST
Location / Room: St Clair 2

Organisers:
Chiara Sandionigi, CEA, FR
Jean-Christophe Crebier, CNRS/G-INP/UGA, FR
Jonas Gustafsson, RISE, SE

Eco-ES, the workshop devoted to Eco-design and circular economy of Electronic Systems, comes back to DATE 2025 for its third edition. This half-day event consists of a plenary keynote, invited talks and regular presentations.

Workshop description

The impact of electronics on the environment is an important issue, especially because of the number of systems growing exponentially. Eco-design and circular economy applied to electronic components and systems are thus becoming major challenges for our society to respond to the dangers for the environment: exponential increase in electronic waste generation, depletion of resources, contribution to climate change and poor resiliency to supply-chain issues. Electronics designers willing to engage in eco-design face several difficulties, related in particular to a limited knowledge of the environmental impact from the design phase and the uncertain extension of the service lifetime of the system or parts of the system, owing to the variability in user behaviour and business models.
The objective of the workshop Eco-ES is to gather experts from both academia and industry, covering a wide scope in the environmental sustainability of electronics. The workshop is proposed and organized by partners of the European project EECONE (eecone.com), aiming to create an ecosystem on sustainable electronics.

W01.1 Session 1: Sustainable development of electronics

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 16:00 CEST
Location / Room: St Clair 2

Session chair:
Jean-Christophe Crebier, CNRS/G-INP/UGA, FR

14:00 – 14:05: Workshop introduction (Chiara Sandionigi, CEA)
14:05 – 14:45: Keynote ‘A long way towards more sustainable electronics’ (Jean-Pierre Raskin, UCLouvain)
14:45 – 15:00: ‘Fabrication of a PCB for a remote control unit using additive manufacturing techniques’ (Jamal Tallal, CEA)
15:00 – 15:15: ‘Identification of Electronic Components for Life Cycle Assessment’ (Benjamin Chedotel, INSA Rennes)
15:15 – 15:30: ‘Life Cycle Analysis for Greener Small Electronics: An Extended Model for IoT’ (Morgan Monroe, CSEM)
15:30 – 16:00: Debate

W01.2 Session 2: Second life of electronics and reduction of e-waste

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: St Clair 2

Session chair:
Jean-Christophe Crebier, CNRS/G-INP/UGA, FR

16:30 – 16:50: ‘Sustainable electronics: How DPP can facilitate components recovery and reuse’ (Marc Joussé, Retronix)
16:50 – 17:05: ‘Recovery of critical metals from electronic waste (WEEE) by applying a selected bacterial consortium as a nature-based solution’ (Jean Martins, CNRS/G-INP)
17:05 – 17:20: ‘Collecting Electronic Waste: Gaps in Behavior of Citizens and SMEs in Estonia’ (Yannick Le Moullec, Tallinn University of Technology)
17:20 – 17:50: Debate
17:50 – 18:00: Workshop final remarks (Chiara Sandionigi, CEA)

W04 CANCELLED: 5th Workshop on Open-Source Design Automation (OSDA 2025)

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 14:00 CEST - 18:00 CEST

Organisers:
Christian Krieg, TU Wien, AT
Claire Xenia Wolf, YosysHQ, AT

This workshop was cancelled by the organiser. We hope to offer it again at DATE 2026.

CFI-CP Career Fair - Industry: Company Presentations

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:15 CEST - 17:30 CEST
Location / Room: Rhône 1

In this Young People Programme event, sponsoring companies introduce themselves, their ongoing activities, and work culture. Presenting companies are:

ARM
Cadence Design Systems
Synopsys
Racyics

ASD03 ASD focus session: Dynamic, Multi-Agent Sensing-to-Action Loops in Distributed Autonomous Edge Computing Systems: Opportunities and Challenges

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Roseraie 1&2

Organisers:
Amit Ranjan Trivedi, University of Illinois at Chicago, US
Saibal Mukhopadhyay, Georgia Tech, US

Autonomous edge computing in robotics, smart cities, and autonomous vehicles depends on seamlessly integrating sensing, processing, and actuation for real-time decision-making in dynamic environments. At its core is the sensing-to-action loop, which continuously aligns sensor inputs with computational models to drive adaptive control. These loops enhance responsiveness by adapting to hyper-local conditions but face challenges like resource constraints, synchronization delays in multi-modal data fusion, and the risk of cascading errors. This focus session examines how proactive, context-aware sensing-to-action and action-to-sensing adaptations can improve efficiency by dynamically adjusting sensing and computation based on task demands, such as selectively sensing a small part of the environment and predicting the rest. Action-to-sensing pathways improve task relevance and resource use by guiding sensing through control actions but require robust monitoring to prevent cascading errors. Multi-agent sensing-action loops extend these benefits through coordinated sensing and actions, optimizing resources via collaboration. Additionally, neuromorphic computing, inspired by biological systems, enables spike-based, event-driven processing that conserves energy, reduces latency, and supports hierarchical control—making it well-suited for multi-agent optimization. Finally, the session highlights the importance of co-designing algorithms, hardware, and environmental dynamics to improve throughput, precision, and adaptability, ultimately advancing energy-efficient edge autonomy in complex environments.

Time	Label	Presentation Title Authors
16:30 CEST	ASD03.1	SPECULATIVE EDGE-CLOUD DECODING FOR FAST AND RELIABLE DECISION-MAKING IN AUTONOMOUS SYSTEMS Presenter: Priyadarshini Panda, Yale University, US Author: Priyadarshini Panda, Yale University, US Abstract .
16:40 CEST	ASD03.2	FILLING IN THE SENSING BLANKS WITH GENERATIVE AI: ULTRA-FRUGAL LIDAR PERCEPTION USING MASKED AUTOENCODERS FOR AUTONOMOUS NAVIGATION Presenter: Amit Trivedi, University of Illinois at Chicago, US Author: Amit Trivedi, University of Illinois at Chicago, US Abstract .
16:50 CEST	ASD03.3	ROBOKOOP: EFFICIENT VISUAL CONTROL REPRESENTATIONS FOR ROBOTICS VIA THE KOOPMAN OPERATOR Presenter: Saibal Mukhopadhyay, Georgia Tech, US Author: Saibal Mukhopadhyay, Georgia Tech, US Abstract .
17:00 CEST	ASD03.4	NEUROMORPHIC NAVIGATION IN THE REAL WORLD: INTEGRATING REAL-TIME EVENT-BASED VISION WITH PHYSICS-GUIDED PLANNING Presenter: Kaushik Roy, Purdue University, US Author: Kaushik Roy, Purdue University, US Abstract .
17:10 CEST	ASD03.5	PANEL DISCUSSION Presenter: All the Panelists, DATE 2025, FR Author: All the Panelists, DATE 2025, FR Abstract .

BPA03 BPA Session 3

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: St Clair 3AB

Session chair:
Nele Mentens, Leiden University, NL

Session co-chair:
Michael Hutter, PQShield, AT

Time	Label	Presentation Title Authors
16:30 CEST	BPA03.1	TYRCA: A RISC-V TIGHTLY-COUPLED ACCELERATOR FOR CODE-BASED CRYPTOGRAPHY Speaker: Alessandra Dolmeta, Politecnico di Torino, IT Authors: Alessandra Dolmeta¹, Stefano Di Matteo², Emanuele Valea³, Mikael Carmona⁴, Antoine Loiseau⁴, Maurizio Martina⁵ and Guido Masera⁵ ¹Politecnico di Torino, IT; ²CEA-Leti, CEA-List, FR; ³CEA-List, FR; ⁴CEA-Leti, FR; ⁵DET - Politecnico di Torino, IT Abstract Post-quantum cryptography (PQC) has garnered significant attention across various communities, particularly with the National Institute of Standards and Technology (NIST) advancing to the fourth round of PQC standardization. One of the leading candidates is Hamming Quasi-Cyclic (HQC), which received a significant update on February 23, 2024. This update, which introduces a classical dense-dense multiplication approach, has not known dedicated hardware implementations yet. The innovative Core-V eXtension InterFace (CV-X-IF) is a communication interface for RISC-V processors that significantly facilitates the integration of new instructions to the Instruction Set Architecture (ISA), through tightly connected accelerators. In this paper, we present a TightlY-coupled accelerator for RISC-V for Code-based cryptogrAphy (TYRCA), proposing the first fully tightly-coupled hardware implementation of the HQC-PQC algorithm, leveraging the CV-X-IF. The proposed architecture is implemented on the Xilinx Kintex-7 FPGA. Experimental results demonstrate that TYRCA reduces the execution time by 94% to 96% for HQC-128, HQC-192, and HQC-256, showcasing its potential for efficient HQC code-based cryptography.
16:50 CEST	BPA03.2	A SOFT ERROR TOLERANT DUAL STORAGE MODE FLIP-FLOP FOR EFPGA CONFIGURATION HARDENING IN 22NM FINFET PROCESS Speaker: Prashanth Mohan, Carnegie Mellon University, US Authors: Prashanth Mohan¹, Siddharth Das¹, Oguz Aatli¹, Josh Joffrion² and Ken Mai¹ ¹Carnegie Mellon University, US; ²Sandia National Laboratories, US Abstract We propose a soft error tolerant flip-flop (FF) design to protect configuration storage cells in standard cell-based embedded FPGA fabrics used in SoC designs. Traditional rad-hard FFs such as DICE and Triple Modular Redundant (TMR) use additional redundant storage nodes for soft error tolerance and hence incur high area overheads. Since the eFPGA configuration storage is static, the master latch of the FF is transparent and unused, except when a configuration is loaded. The proposed dual-storage-mode (DSM) FF reuses the master and slave latches as redundant storage along with a C-element for error correction. The DSM FF was fabricated on a 22nm FinFET process along with standard D-FF, pulse DICE FF, and TMR FF designs to evaluate soft error tolerance. The radiation test results show that the DSM FF can reduce the error cross section by more than three orders of magnitude (3735X) compared to the standard D-FF and two orders of magnitude (455X) compared to the pulse DICE FF with a comparable area. Additionally the DSM FF is ~42% smaller than the TMR FF with similar error cross section.
17:10 CEST	BPA03.3	REBERT: LLM FOR GATE-LEVEL TO WORD-LEVEL REVERSE ENGINEERING Speaker: Azadeh Davoodi, University of Wisconsin Madison, US Authors: Lizi Zhang¹, Azadeh Davoodi² and Rasit Topaloglu³ ¹University of Wisconsin Madison, US; ²University of Wisconsin - Madison, US; ³Adeia, US Abstract In this paper, we introduce ReBERT, a specialized large language model (LLM) based on BERT, fine-tuned specifically for grouping bits into words within gate-level netlists. By treating the netlist as a form of language, we encode bits and their fan-in cones into sequences that capture structural dependencies. A novel contribution is augmenting BERT's embedding with a tree-based embedding strategy which mirrors the hierarchical nature of circuit designs in hardware. Leveraging the powerful representational learning capabilities of LLMs, we interpret hardware circuits at a higher level of abstraction. We evaluate ReBERT on various hardware designs, demonstrating that it significantly outperforms a state-of-the-art work based on partial structural matching in recovering word-level groupings. Our improvements are on average between 12.2% to 218.2% depending on degree of corrupting the structural patterns.

FS03 Focus session - Design Automation for Physical Computing Systems

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium Pasteur

Session chair:
Antonino Tumeo, PNNL, US

Organiser:
Anup Das, Drexel University, US

Time	Label	Presentation Title Authors
16:30 CEST	FS03.1	ANALOG SYSTEM SYNTHESIS FOR FPAAS AND CUSTOM ANALOG IC DESIGN Speaker: Jennifer Hasler, Georgia Tech, US Authors: Afolabi Ige and Jennifer Hasler, Georgia Tech, US Abstract Synthesis tools can unlock the potential of analog architectures to achieve real-time computation, signal processing, inference and learning for low SWaP systems in commercial timescales. We present a methodology and results towards system analog and mixed-signal synthesis both for FPAAs and Custom Analog IC design. Building on previously efforts on large-scale Field Programmable Analog Arrays (FPAA) targeting tools enables tools capable of synthesizing new ICs. The IC synthesis is built upon our recent work on analog & mixed-signal programmable CMOS standard cell library that can be demonstrates across a range of CMOS process nodes (e.g. 180nm, 130nm, 65nm, 28nm, and 16nm CMOS). This synthesis can be extended to synthesizing new configurabile fabrics for a new IC and generate the resulting configuration files to target that fabric. The entire tool-flow is being developed as an open-source tool that can be widely available. These approaches enable moving analog and mixed-signal design towards structured Design Space Exploration}(DSE), and create a significant need towards rapid analog simulation.
16:53 CEST	FS03.2	CHEMCOMP: COMPILING AND COMPUTING WITH CHEMICAL REACTION NETWORKS Speaker: Antonino Tumeo, Pacific Northwest National Laboratory, US Authors: Nicolas Agostini, Connah Johnson, William Cannon and Antonino Tumeo, Pacific Northwest National Laboratory, US Abstract The exponential growth in computing demands driven by scientific computing, data analytics, and artificial intelligence is pushing conventional CMOS-based high-performance computing systems to their physical and energy efficiency limits. As we approach the era of post-exascale computing, disruptive approaches are necessary to overcome these barriers and achieve substantial gains in energy efficiency. Analog and hybrid digital-analog computing systems have emerged as promising alternatives, offering the potential for orders-of-magnitude improvements in efficiency. Among these, biochemical computing stands out as a novel paradigm capable of leveraging the natural efficiency of chemical reactions, which inherently solve optimization problems by converging to steady states. By scaling up reaction networks or reaction vessel sizes, biochemical systems present an opportunity to meet the high-performance demands of modern computing tasks. Despite their promise, significant theoretical and practical challenges remain, particularly in formulating and mapping computational problems to chemical reaction networks (CRNs) and designing viable biochemical computing devices. This paper addresses these challenges by introducing ChemComp, a comprehensive framework for chemical computation. The framework features an abstract chemical reaction dialect implemented as a multi-level intermediate representation (MLIR) compiler extension and provides a systematic approach to translating mathematical problems into CRNs. We demonstrate the potential of our framework through a case study emulating a simplified chemical reservoir computing device. This work establishes the foundational tools and methodologies necessary to harness the computational power of chemistry, paving the way for the development of energy-efficient, high-performance computing systems tailored to contemporary and future computational needs.
17:15 CEST	FS03.3	EXPLORING DENDRITIC COMPUTATION IN BIO-INSPIRED ARCHITECTURES FOR DYNAMIC PROGRAMMING Speaker and Author: Anup Das, Drexel University, US Abstract Dynamic programming is a classical optimization technique that systematically decomposes a complex problem into simpler sub-problems to find an optimal solution. We explore the use of bio-inspired architectures to find the shortest path between two nodes in a graph using dynamic programming. We leverage dendritic computations, which are linear and non- linear mechanisms in neuronal dendrites that allow to implement different computational primitives. We exploit two key mechanisms: 1) a dendrite acts as a delay line to propagate an excitatory post-synaptic potential to the soma, and 2) a feedback mechanism from the soma into the dendrites to control this delay. Our key ideas are the following. First, we model each node on a graph as a leaky integrate-and-fire (LIF) neuron, supporting the two dendritic mechanisms. We use a countdown counter to implement forward propagation of a delayed synaptic potential and eligibility trace-based feedback to update the delay by incorporating the cost of edges in a graph. Next, we formulate dynamic programming in terms of the time to the first spike in neurons. We breakdown the shortest path problem into sub- problems of finding the earliest firing times of neurons and iteratively building the final solution from these smaller sub- problems by tracing backward. We implement this approach for several real-world graphs and show its scalability. We also show an early prototype on a Virtex UltraScale FPGA.
17:38 CEST	FS03.4	PANEL DISCUSSION Presenter: All the Panelists, DATE 2025, FR Author: All the Panelists, DATE 2025, FR Abstract .

TS04 Emerging design technologies for future memories

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Salon Pasteur

Session chair:
Guillaume Prenat, CEA, FR

Session co-chair:
William Simon, IBM, CH

Time	Label	Presentation Title Authors
16:30 CEST	TS04.1	GLEAM: GRAPH-BASED LEARNING THROUGH EFFICIENT AGGREGATION IN MEMORY Speaker: Ivris Raymond, University of Michigan, US Authors: Andrew McCrabb, Ivris Raymond and Valeria Bertacco, University of Michigan, US Abstract Graph Neural Networks (GNNs) have emerged as a powerful tool for analyzing relationship-based data, such as those found in social networks, logistics, weather forecasting, and other domains. Inference and training with GNN models execute slowly, bottlenecked by limited data bandwidths between memory and GPU hosts, as a result of the many irregular memory accesses inherent to GNN-based computation. To overcome these limitations, we present GLEAM, a Processing-in-Memory (PIM) hardware accelerator designed specifically for GNN-based training and inference. GLEAM units are placed per-bank and leverage the much larger, internal bandwidth of HBMs to handle GNNs' irregular memory accesses, significantly boosting performance and reducing the energy consumption entailed by the dominant activity of GNN-based computation: neighbor aggregation. Our evaluation of GLEAM demonstrates up to a 10x speedup for GNN inference over GPU baselines, alongside a significant reduction in energy usage.
16:35 CEST	TS04.2	PFP: PARALLEL FLOATING-POINT VECTOR MULTIPLICATION ACCELERATION IN MAGIC RERAM Speaker: Wenqing Wang, National University of Defense Technology, CN Authors: Wenqing Wang, Ziming Chen, Quan Deng and Liang Fang, National University of Defense Technology, CN Abstract Emerging applications, e.g., machine learning, large language models (LLMs), and graphic processing, are rapidly developing and are both compute-intensive and memory-intensive. Computing in Memory (CIM) is a promising architecture that accelerates these applications by eliminating the data movement between memory and processing units. Memristor-aided logic (MAGIC) CIM achieves massive parallelism, flexible computing, and non-volatility. However, MAGIC ReRAM performs floating-point (FP) vector multiplication sequentially, which wastes parallel computing resources and is limited by the array size. To solve this issue, we propose a parallel floating-point vector multiplication accelerator in MAGIC ReRAM. We exploit three levels of parallelism during the calculation of FP vector multiplication, referred to as PFP. First, we leverage the parallelism of MAGIC ReRAM. Second, we bring forward the final exponent to make the exponent calculations parallel. Third, we decouple the calculation of exponent, mantissa, and sign, which allows parallel calculation across accumulation. The experimental results show that PFP achieves a performance speedup of 2.51× and 15% energy savings compared to AritPIM when performing FP32 vector multiplication with a vector length of 512.
16:40 CEST	TS04.3	AN EDRAM DIGITAL IN-MEMORY NEURAL NETWORK ACCELERATOR FOR HIGH-THROUGHPUT AND EXTENDED DATA RETENTION TIME Speaker: Jehun Lee, Seoul National University, KR Authors: Inhwan Lee¹, Jehun Lee², Jaeyong Jang² and Jae-Joon Kim² ¹Pohang University of Science and Technology, KR; ²Seoul National University, KR Abstract Computing-in-Memory (CIM) optimizes multiply-and-accumulate (MAC) operations for energy-efficient acceleration of neural network models. While SRAM has been a popular choice for CIM designs due to its compatibility with logic processes, its large cell size restricts storage capacity for neural network parameters. Consequently, gain-cell eDRAM, featuring memory cells with only 2-4 transistors, has emerged as an alternative for CIM cells. While digital CIM (DCIM) structure has been actively adopted in SRAM-based CIMs for better accuracy and scalability than analog CIMs (ACIM), previous eDRAM-based CIMs still employed ACIM structure since the eDRAM CIM cells were not able to perform a complete digital logic operation. In this paper, we propose an eDRAM bit cell for more efficient DCIM operations using only 4 transistors. The proposed eDRAM DCIM structure also maintains consistent and accurate output values over time, improving retention times compared to previous eDRAM ACIM designs. We validate our approach by fabricating an eDRAM DCIM macro chip and conducting hardware validation experiments, measuring retention time and neural network accuracy. Experimental results show that the proposed eDRAM DCIM achieves 3× longer retention time than state-of-the-art eDRAM ACIM designs, along with higher throughput without accuracy loss.
16:45 CEST	TS04.4	A TWO-LEVEL SLC CACHE HIERARCHY FOR HYBRID SSDS Speaker: Jun Li, Nanjing University of Posts and Telecommunications, CN Authors: Li Cai¹, Zhibing Sha¹, Jun Li², Jiaojiao Wu¹, Huanhuan Tian¹, Zhigang Cai¹ and Jianwei Liao¹ ¹Southwest University, CN; ²Nanjing University of Posts and Telecommunications, CN Abstract Although high-density NAND flash memory, such as triple-level-cell (TLC) flash memory can offer high density, its lower write performance and endurance compared to single-levelcell (SLC) flash memory are impediments to the proliferation of TLC products. To overcome such disadvantages of TLC flash memory, hybrid architectures, which integrate a portion of SLC chips and employ them as a write cache, are widely adopted in commercial solid-state disks (SSDs). However, it is challenging to optimize the SLC cache, such as the granularity of cached data and the cold/hot data separation. In this paper, we propose supporting two-level hierarchy (i.e. L1 and L2) of SLC cache stores based on varying granularity of cached data. Moreover, we support the segmentation of the L1 and the L2 cache in the SLC region with a dynamic manner, by considering the write size characteristics of user applications. The evaluation results show that our proposal can improve I/O performance by between 12.6% and 25.1%, in contrast to existing cache management schemes for SLC-TLC hybrid storage.
16:50 CEST	TS04.5	MULTI-MODE BORDERGUARD CONTROLLERS FOR EFFICIENT ON-CHIP COMMUNICATION IN HETEROGENEOUS DIGITAL/ANALOG NEURAL PROCESSING UNITS Speaker: Hong Pang, ETH Zurich, CH Authors: Hong Pang¹, Carmine Cappetta², Riccardo Massa², Athanasios Vasilopoulos³, Elena Ferro³, Gamze Islamoglu¹, Angelo Garofalo⁴, Francesco Conti⁵, Luca Benini⁶, Irem Boybat³ and Thomas Boesch⁷ ¹ETH Zurich, CH; ²STMicroelectronics, IT; ³IBM Research Europe - Zurich, CH; ⁴University of Bologna, ETH Zurich, IT; ⁵Università di Bologna, IT; ⁶ETH Zurich, CH \| Università di Bologna, IT; ⁷STMicroelectronics, CH Abstract Driven by the growing demand for data-intensive parallel computation, particularly for Matrix-Vector Multiplications (MVMs), and the pursuit of high energy efficiency, Analog In-Memory Computing (AIMC) has garnered significant attention. AIMC addresses the data movement bottleneck by performing MVMs directly within memory, significantly reducing latency and enhancing energy efficiency. Integrating AIMC with digital units for non-MVM operations yields heterogeneous Neural Processing Units (NPUs) that can be combined in a tiled architecture to deliver promising solutions for end-to-end AI inference. Besides powerful heterogeneous NPUs, an efficient on-chip communication infrastructure is also pivotal for inter-node data transmission and efficient AI model execution. This paper introduces the Borderguard Controller (BG-CTRL), a multi-mode, path-through routing controller designed to support three distinct operating modes—time-scheduling, data-driven, and time-sliced data-driven (TSDD)—each offering varying levels of routing flexibility and energy efficiency depending on the data flow patterns and AI model complexity. To demonstrate the design, BG-CTRLs are integrated into a 9-node system of heterogeneous NPUs, arranged in a 3x3 grid and connected using a 2D mesh topology. The system is synthesized using STM 28nm FD-SOI technology. Experimental results show that the BG-CTRL cluster achieves an aggregate throughput of 983 Gb/s, with an energy efficiency of up to 0.41 pJ/B/hop at 0.64 GHz, and a minimal area overhead of 204 kGE.
16:55 CEST	TS04.6	MAPPING SPIKING NEURAL NETWORKS TO HETEROGENEOUS CROSSBAR ARCHITECTURES USING INTEGER LINEAR PROGRAMMING Speaker: Devin Pohl, Georgia Tech, US Authors: Devin Pohl¹, Aaron Young², Kazi Asifuzzaman², Narasinga Miniskar² and Jeffrey Vetter² ¹Georgia Tech, US; ²Oak Ridge National Lab, US Abstract Advances in novel hardware devices and architectures allow Spiking Neural Network (SNN) evaluation using ultra-low power, mixed-signal, memristor crossbar arrays. As individual network sizes quickly scale beyond the dimensional capabilities of single crossbars, networks must be mapped onto multiple crossbars. Crossbar sizes within modern Memristor Crossbar Architectures (MCAs) are determined predominately not by device technology but by network topology; more, smaller crossbars consume less area thanks to the high structural sparsity found in larger, brain-inspired SNNs. Motivated by continuing increases in SNN sparsity due to improvements in training methods, we propose utilizing heterogeneous crossbar sizes to further reduce area consumption. This approach was previously unachievable as prior compiler studies only explored solutions targeting homogeneous MCAs. Our work improves on the state-of-the-art by providing Integer Linear Programming (ILP) formulations supporting arbitrarily heterogeneous architectures. By modeling axonal interactions between neurons, our methods produce better mappings while removing inhibitive a priori knowledge requirements. We first show a 16.7–27.6% reduction in area consumption for square-crossbar homogeneous architectures. Then, we demonstrate 66.9–72.7% further reduction when using a reasonable configuration of heterogeneous crossbar dimensions. Next, we present a new optimization formulation capable of minimizing the number of inter-crossbar routes. When applied to solutions already near-optimal in area, an 11.9–26.4% routing reduction is observed without impacting area consumption. Finally, we present a profile-guided optimization capable of minimizing the number of runtime spikes between crossbars. Compared to the best-area-then-route optimized solutions, we observe a further 0.5–14.8% inter-crossbar spike reduction while requiring 1–3 orders of magnitude less solver time.
17:00 CEST	TS04.7	AN EFFICIENT ON-CHIP REFERENCE SEARCH AND OPTIMIZATION ALGORITHMS FOR VARIATION-TOLERANT STT-MRAM READ Speaker: Kiho Chung, Sungkyunkwan University, KR Authors: Kiho Chung, Youjin Choi, Donguk Seo and Yoonmyung Lee, Sungkyunkwan University, KR Abstract A novel reference search algorithm is proposed in this paper to significantly reduce the reference search time of embedded spin transfer torque magnetic random access memory (STT-MRAM). Unlike conventional methods that sequentially search reference levels with linearly increasing references, the proposed Dual Read Reference Search (DRRS) algorithm requires only two array read operations. By analyzing the statistical characteristics of the read data using a customized function, the optimal reference level can be quickly determined in a few steps. Consequently, the number of read operations required for a reference search is reduced, providing a substantial improvement in the reference search time. The DRRS algorithm can be operated on-chip, and its effectiveness was confirmed through simulations. The optimization speed was improved by 85% compared to the conventional methods. Additionally, an Triple Read Reference Search (TRRS) algorithm is proposed to decrease the variation occurring across different cell arrays and to enhance optimization accuracy. STT-MRAM is composed of numerous cell arrays, where the cell distributions in each array exhibit different characteristics. The TRRS algorithm enhances optimization accuracy for variations occurring in each array, achieving over a 2x increase in accuracy compared to the DRRS algorithm. Furthermore, Simultaneous Reference Search for P and AP (SRS) algorithm that significantly reduces the search time by simultaneously optimizing Parallel (P) and Anti-parallel state (AP) reference cells is also proposed. Lastly, regarding cell degradation after power-up, we enable prompt re-optimization through revolutionary time-saving algorithms (DRRS, TRRS and SRS). This allows for rapid re-optimization in the event of errors caused by cell degradation and ensures regular optimization to maintain maximum read margin even before errors occur, thereby enhancing reliability.
17:05 CEST	TS04.8	FDAIMC: A FULLY-DIFFERENTIAL ANALOG IN-MEMORY-COMPUTING FOR MAC IN MRAM WITH ACCURACY CALIBRATION UNDER PROCESS AND VOLTAGE VARIATION Speaker: Xiangyu Li, School of Microelectronics Science and Technology, Sun Yat-sen University, CN Authors: Xiangyu Li¹, Weichong Chen¹, Ruida Hong¹, Jinghai Wang², Ningyuan Yin¹ and Zhiyi Yu¹ ¹School of Microelectronics Science and Technology, Sun Yat-sen University, CN; ²Sun Yat-sen University, CN Abstract Analog in-memory-computing (AIMC) is adopted extensively in non-volatile memory for multibit multiply-and-accumulate (MAC) operation. However, the low-on/off-ratio feature of magnetic tunnel junction (MTJ) impedes a high-performance AIMC macro based on spin transfer torque magnetic random access memory (STT-MRAM). Secondly, because of the uncertainty feature of a mixed-signal system under process and voltage variation, a calibration support is indispensable. Moreover, the incompatibility between a nonlinear analog signal and a linear digital signal hinders accurate computation and calibration support. To overcome these challenges, this work proposes a STT-MRAM-AIMC macro featuring: 1) a 2-level-differential cell array and a linear computing scheme with a calibration support in analog domain; 2) an analog-digital-conversion (ADC) system, including a slew-rate-independent voltage-to-time converter (SRIVTC) scheme and a self-triggered time-to-MAC value converter (STTMC) scheme; 3) a compact layout design for high area efficiency. Finally, an average accuracy of 95.44% is obtained under the TT&0.9V corner. By using the calibration strategy, the average accuracy of 97.8% and 88.6% are obtained under FF&0.945V and SS&0.855V separately, with over 30% enhancement. Furthermore, a 1.64~21.18 times area FoM than state of the art is obtained. An energy efficiency of 87.2~312.4 TOPS/W is obtained.
17:10 CEST	TS04.9	ARBITER: ALLEVIATING CONCURRENT WRITE AMPLIFICATION IN PERSISTENT MEMORY Speaker: Bolun Zhu, Huazhong University of Science and Technology, CN Authors: Bolun Zhu and Yu Hua, Huazhong University of Science and Technology, CN Abstract Persistent memory (PM) is able to bridge the gap between the high performance and persistence, thus receiving many research attentions. The concurrency in PM is often constrained due to limited concurrent I/O bandwidth. The I/O requests from different threads are serialized and interleaved in the memory controller. Such concurrent interleaving unintentionally hurts the locality of PM's on-DIMM buffer (XPBuffer) and thus causes significant performance degradation. Existing systems either endure performance degradation caused by the concurrent interleaving or leverage dedicated background threads to asynchronously perform I/O to PM. Unlike conventional designs, we present a non-blocking synchronous I/O scheduling mechanism that can achieve high performance and low I/O amplification. The key insight is that inserting a proper number of delays to I/O can mitigate the I/O amplification and improve the effective bandwidth. We periodically assess the system states and adaptively determine the number of delays to be inserted for each thread. Evaluation results show that our design can significantly alleviate the I/O amplification and improve application performance for concurrent applications.
17:15 CEST	TS04.10	TRACKSCORER: SKYRMION LOGIC-IN-MEMORY ACCELERATOR FOR TREE-BASED RANKING MODELS Speaker: Elijah Cishugi, University of Twente, NL Authors: Elijah Cishugi¹, Sebastian Buschjäger², Martijn Noorlander¹, Marco Ottavi³ and Kuan-Hsun Chen¹ ¹University of Twente, NL; ²The Lamarr Institute for Machine Learning and Artiﬁcial Intelligence and TU Dortmund University, DE; ³University of Rome Tor Vergata \| University of Twente, IT Abstract Racetrack memories (RTMs) have been shown to have lower leakage power and higher density compared to traditional DRAM/SRAM technologies. However, their efficiency is often hindered by the need to shift the targeted data to access ports for read and write operations. Suitable mapping approaches are therefore essential to unleash their potential. In this work, we explore the mapping of the popular tree-based document ranking algorithm, Quickscorer, onto Skyrmion-based racetrack memories (SK-RTMs). Our approach leverages a Logic-in-Memory (LiM) accelerator, specifically designed to execute simple logic operations directly within SK-RTMs, enabling an efficient mapping of Quickscorer by exploiting its bitvector representation and interleaved traversal scheme of tree structures through bitwise logical operations. We present several mapping strategies, including one based on a quadratic assignment problem (QAP) optimization algorithm for optimal data placement of Quickscorer onto the racetracks. Our results demonstrate a significant reduction in read and write operations and, in certain cases, a decrease in the time spent shifting data during Quickscorer inference.
17:20 CEST	TS04.11	EF-IMR: EMBEDDED FLASH WITH INTERLACED MAGNETIC RECORDING TECHNOLOGY Speaker: Chenlin Ma, Shenzhen University, CN Authors: Chenlin Ma, Xiaochuan Zheng, Kaoyi Sun, Tianyu Wang and Yi Wang, Shenzhen University, CN Abstract Interlaced Magnetic Recording (IMR), a technology that improves storage density through track overlap, introduces signiffcant latency due to Read-Modify-Write (RMW) operations. Writing to overlapped tracks affects underlying tracks, requiring additional I/O operations to read, back up, and rewrite them, resulting in signiffcant head movement latency. We propose EF-IMR, a new architecture that ensures crash consistency in IMR while minimizing RMW latency and head movement. EF-IMR reduces head movement during RMW operations and decreases redundant RMW operations. Evaluations under real-world, intensive I/O workloads show that EF-IMR reduces RMW latency by 20.11% and head movement latency by 89.37% compared to existing methods.

TS05 System-level design methodologies and high-level synthesis

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Rhône 3AB

Session chair:
Luis Almeida, University of Porto, PT

Session co-chair:
Leonidas Kosmidis, BSC, ES

Time	Label	Presentation Title Authors
16:30 CEST	TS05.1	IMPROVING LLM-BASED VERILOG CODE GENERATION WITH DATA AUGMENTATION AND RL Speaker: Kyungjun Min, Pohang University of Science and Technology, KR Authors: Kyungjun Min, Seonghyeon Park, Hyeonwoo Park, Jinoh Cho and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract Large language models (LLMs) have recently attracted significant attention for their potential in Verilog code generation. However, existing LLM-based methods face several challenges, including data scarcity and the high computational cost of generating prompts for fine-tuning. Motivated by these challenges, we explore methods to augment training datasets, develop more efficient and effective prompts for fine-tuning, and implement training methods incorporating electronic design automation (EDA) tools. Our proposed framework for fine-tuning LLMs for Verilog code generation includes (1) abstract syntax tree (AST)-based data augmentation, (2) output-relevant code masking, a prompt generation method based on the logical structure of Verilog code, and (3) reinforcement learning with tool feedback (RLTF), a fine-tuning method using EDA tool results. Experimental studies confirm that our framework significantly improves syntax and functional correctness, outperforming commercial and non-commercial models on open-source benchmarks.
16:35 CEST	TS05.2	SPARDR:ACCELERATING UNSTRUCTURED SPARSE DNN INFERENCE VIA DATAFLOW OPTIMIZATION Speaker: Wei Wang, Beihang University, CN Authors: Wei Wang, Hongxu Jiang, Runhua Zhang, Yongxiang Cao and Yaochen Han, Beihang University, CN Abstract Unstructured sparsity is becoming a key dimension in exploring the inference efficiency of neural networks. However, its data layout presents irregularity, making it difficult to match the parallel computing mode of hardware, resulting in low computational and memory access efficiency. We have studied this issue and found that the main reason is that existing sparse acceleration libraries and compilers perform sparse matrix multiplication optimization exploration through the splitting and reconstruction of sparse patterns, thus ignoring the acceleration of sparse convolution operations centered on data streams, which may miss some optimization opportunities for sparse operations. In this article, we propose SparDR, a general sparse convolution operation acceleration method centered around data streams. Through novel feature map data stream reconstruction and convolutional kernel data representation, redundant zero value calculations are effectively avoided, addressing efficiency is improved, and memory overhead is reduced. SparDR is based on TVM and allows for automatic scheduling across different hardware configurations. Compared with the current mainstream five methods on four types of hardware, the inference delay acceleration reaches 1.1-12x and the memory usage decreases by 20%.
16:40 CEST	TS05.3	AN IMITATION AUGMENTED REINFORCEMENT LEARNING FRAMEWORK FOR CGRA DESIGN SPACE EXPLORATION Speaker: Liangji Wu, Southeast University, Nanjing, Jiangsu Province, CN Authors: Liangji Wu, Shuaibo Huang, Ziqi Wang, Shiyang Wu, Yang Chen, Hao Yan and Longxing Shi, Southeast University, CN Abstract Coarse-Grained Reconfigurable Arrays (CGRAs) are a promising architecture that warrants thorough design space exploration (DSE). However, Traditional DSE methods for CGRAs often get trapped in local optima due to singularities, i.e., invalid design points caused by CGRA mapping failures. In this paper, we propose a singularity-aware framework based on the integration of reinforcement learning (RL) and imitation learning (IL) for DSE of CGRAs. Our approach learns from both valid and invalid points, substantially reducing the probability of sampling singularities and accelerating the escape from inefficient regions, ultimately achieving high-quality Pareto points. Experimental results demonstrate that our framework improves the hypervolume (HV) of the Pareto front by 23.56% compared to state-of-the-art methods, with a comparable time overhead.
16:45 CEST	TS05.4	OPERATION DEPENDENCY GRAPH-BASED SCHEDULING FOR HIGH-LEVEL SYNTHESIS Speaker: AOXIANG QIN, Sun Yat-sen university, CN Authors: Aoxiang Qin¹, Minghua Shen¹ and Nong Xiao² ¹Sun Yat-sen University, CN; ²The School of Computer, Sun Yat-sen University,Panyue, CN Abstract Scheduling determines the execution order and time of operations in program. The order is related to operation dependencies, including data and resource dependencies. Data dependencies are intrinsic in programs, while resource dependencies are determined by scheduling methods. Existing scheduling methods lack an accurate and complete operation dependency graph (ODG), leading to poor performance. In this paper, we propose an ODG-based scheduling method for HLS with GNN and RL. We adopt GNN to perceive accurate relations between operations. We use the relations to guide an RL agent in building a complete ODG. We perform feedback-guided iterative scheduling with the graph to converge to a high-quality solution. Experiments show that our method reduces 23.8% and 16.4% latency on average, compared with the latest GNN-based and RL-based methods, respectively.
16:50 CEST	TS05.5	LOCALITY-AWARE DATA PLACEMENT FOR NUMA ARCHITECTURES: DATA DECOUPLING AND ASYNCHRONOUS REPLICATION Speaker: Shuhan Bai, Huazhong University of Science and Technology, CN Authors: Shuhan BAI¹, Haowen Luo², burong Dong³, Jian Zhou¹ and Fei Wu⁴ ¹Huazhong University of Science and Technology, CN; ²HuaZhong university of science and technology, CN; ³Huazhong university of science and technology, CN; ⁴huazhong university of science and technology, CN Abstract Non-Uniform Memory Access (NUMA) architectures bring new opportunities and challenges to bridge the gap between computing power and memory performance. Their complex memory hierarchies feature non-uniform access performance, known as NUMA locality, indicating data placement and access without NUMA-awareness significantly impact performance. Existing NUMA-aware solutions often prioritize fast local access but at the cost of heavy replication overhead, suffering a read-write performance tradeoff and limited scalability. To overcome these limitations, this paper presents $ m{Ladapa}$, a scalable and high-performance locality-aware data placement strategy. The key insight is decoupling data into metadata and data layers, allowing independent management with adaptive asynchronous replication for lower overhead. Additionally, $ m{Ladapa}$ employs multi-level metadata management leveraging fast caches for efficient data location, further boosting performance. Experimental results show that $ m{Ladapa}$ outperforms typical replication techniques by up to 27.37$ imes$ in write performance and 1.63$ imes$ in read performance.
16:55 CEST	TS05.6	HAVEN: HALLUCINATION-MITIGATED LLM FOR VERILOG CODE GENERATION ALIGNED WITH HDL ENGINEERS Speaker: Yiyao Yang, Shanghai Jiao Tong University, CN Authors: Yiyao Yang¹, Fu Teng², Pengju Liu¹, Mengnan Qi¹, Chenyang Lv¹, Ji Li³, Xuhong Zhang² and Zhezhi He¹ ¹Shanghai Jiao Tong University, CN; ²Zhejiang University, CN; ³Independant Researcher, CN Abstract Recently, the use of large language models (LLMs) for Verilog code generation has attracted great research interest to enable hardware design automation. However, previous works have shown a gap between the ability of LLMs and the practical demands of hardware description language (HDL) engineering. This gap includes differences in how engineers phrase questions and hallucinations in the code generated. To address these challenges, we introduce HaVen, a novel LLM framework designed to mitigate hallucinations and align Verilog code generation with the practices of HDL engineers. HaVen tackles hallucination issues by proposing a comprehensive taxonomy and employing a chain-of-thought (CoT) mechanism to translate symbolic modalities (e.g. truth tables, state diagrams, etc.) into accurate natural language descriptions. Furthermore, HaVen bridges this gap by using a data augmentation strategy. It synthesizes high-quality instruction-code pairs that match real HDL engineering practices. Our experiments demonstrate that HaVen significantly improves the correctness of Verilog code generation, outperforming state-of-the-art LLM-based Verilog generation methods on VerilogEval and RTLLM benchmark. HaVen is publicly available at https://github.com/Intelligent-Computing-Research-Group/HaVen.
17:00 CEST	TS05.7	ENABLING MEMORY-EFFICIENT ON-DEVICE LEARNING VIA DATASET CONDENSATION Speaker: Gelei Xu, University of Notre Dame, US Authors: Gelei Xu¹, Ningzhi Tang¹, Jun Xia¹, Ruiyang Qin², Wei Jin³ and Yiyu Shi¹ ¹University of Notre Dame, US; ²Villanova University, US; ³Emory University, US Abstract Upon deployment to edge devices, it is often desirable for a model to further learn from streaming data to improve accuracy. However, learning from such data is challenging because it is typically unlabeled, non-independent and identically distributed (non-i.i.d), and only seen once, which can lead to potential catastrophic forgetting. A common strategy to mitigate this issue is to maintain a small data buffer on the edge device to select and retain the most representative data for rehearsal. However, the selection process leads to significant information loss since most data is either never stored or quickly discarded. This paper proposes a framework that addresses this issue by condensing incoming data into informative synthetic samples. Specifically, to effectively handle unlabeled incoming data, we propose a pseudo-labeling technique designed for on-device learning environments. We also develop a dataset condensation technique tailored for on-device learning scenarios, which is significantly faster compared to previous methods. To counteract the effects of noisy labels during the condensation process, we further utilize a feature discrimination objective to improve the purity of class data. Experimental results indicate substantial improvements over existing methods, especially under strict buffer limitations. For instance, with a buffer capacity of just one sample per class, our method achieves a 56.7% relative increase in accuracy compared to the best existing baseline on the CORe50 dataset.
17:05 CEST	TS05.8	TAICHI: EFFICIENT EXECUTION FOR MULTI-DNNS USING GRAPH-BASED SCHEDULING Speaker: Xilang Zhou, Fudan University, CN Authors: Xilang Zhou, Haodong Lu, Tianchen Wang, Zhuoheng Wan, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN Abstract Deep Neural Networks (DNNs) are increasingly used for complex tasks (e.g., AR/VR) by constructing different types of DNNs into a workflow. However, efficient frameworks are lacking for accelerating these applications which have complex connec- tivity and require real-time processing. We introduce ReFA, an FPGA-based co-design framework for acceleration of real-time multi-DNN workloads. Specifically, on the hardware level, we develop an FPGA-based multi-core accelerator, which adopts an unified template for various DNN models and supports depth- first execution to reduce data movements. On the software level, we design a lightweight scheduler based on genetic algorithm, which can find high quality scheduling strategies rapidly from a huge solution space. Our evaluations show that ReFA deployed on Xilinx Alveo U200 achieves up to 10.1-37.3× and 1.4-1.5× reduction in job completion time (JCT), compared with CPU and GPU, respectively. Furthermore, ReFA gains 6.1-9.3×, 7.9×, 5.6-7.1×, and 2.4× reduction in energy-delay product, compared with GPU, Planaria, Herald and H3M, respectively.
17:10 CEST	TS05.9	VTOT: AUTOMATIC VERILOG GENERATION VIA LLMS WITH TREE OF THOUGHTS PROMPTING Speaker: Xiangyu Wang, National University of Defense Technology, CN Authors: Yingjie Zhou¹, Renzhi Chen², Xinyu Li¹, Jingkai Wang¹, Zhigang Fang¹, Bowei Wang¹, Wenqiang Bai¹, Qilin Cao¹ and Lei Wang³ ¹National University of Defense Technology, CN; ²Qiyuan Laboratory, CN; ³Academy of Military Sciences, CN Abstract The automatic generation of Verilog code using Large Language Models (LLMs) presents a compelling solution for enhancing the efficiency of hardware design flow. However, the state-of-the-art performance of LLMs in Verilog generation remains limited when compared to programming languages such as Python. Previous research, Chain of Thought (CoT) has demonstrated that incorporating intermediate reasoning steps can significantly improve the performance of LLMs in code generation. In this paper, we propose the Verilog Tree of Thoughts (VToT) method. This structured prompting technique addresses the abstraction gap between Verilog and CoT by embedding hierarchical design constraints within the prompt. Experimental results on the VerilogEval and RTLLM benchmarks demonstrate that VToT prompting enhances both the syntactic and functional correctness of the generated code.Specifically. Under the RTLLM benchmark, VToT achieved a correctness rate of 75.9\% at pass@5, representing an improvement of 10.4\%. Furthermore, in the VerilogEval benchmark, VToT achieved state-of-the-art performance with a correctness rate of 52.4\% at pass@1 (an increase of 8.9\%) and 65.4\% at pass@5 (an increase of 9.6\%).
17:11 CEST	TS05.10	SIGNAL PREDICTION FOR DIGITAL CIRCUITS BY SIGMOIDAL APPROXIMATIONS USING NEURAL NETWORKS Speaker: Josef Salzmann, TU Wien, AT Authors: Josef Salzmann and Ulrich Schmid, TU Wien, AT Abstract Investigating the temporal behavior of digital circuits is a crucial step in system design, usually done via analog or digital simulation. Analog simulators like SPICE iteratively solve the differential equations characterizing the circuits' components numerically. Although unrivaled in accuracy, this is only feasible for small designs, due to the high computational effort even for short signal traces. Digital simulators use digital abstractions for predicting the timing behavior of a circuit. We advocate a novel approach, which generalizes digital traces to traces consisting of sigmoids, each parameterized by threshold crossing time and slope. For a given gate, we use an artificial neural network for implementing the transfer function that predicts, for any trace of input sigmoids, the parameters of the generated output sigmoids. By means of a prototype simulator, which can handle circuits consisting of inverters and NOR gates, we demonstrate that our approach operates substantially faster than an analog simulator, while offering a much better accuracy than a digital simulator.
17:12 CEST	TS05.11	VERILUA: AN OPEN SOURCE VERSATILE FRAMEWORK FOR EFFICIENT HARDWARE VERIFICATION AND ANALYSIS USING LUAJIT Speaker: Chuyu Zheng, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China, CN Authors: Ye Cai¹, Chuyu Zheng¹, Wei He² and Dan Tang³ ¹Shenzhen University, CN; ²Beiiing Institute of Open Source Chip, CN; ³Institute of Computing Technology, Chinese Academy of Sciences (lCT) / Beiiing Institute of Open Source Chip, CN Abstract The growing complexity of hardware verification highlights limitations in existing frameworks, particularly regarding flexibility and reusability. Current methodologies often require multiple specialized environments for functional verification, waveform analysis, and simulation, leading to toolchain fragmentation and inefficient code reuse. This paper presents Verilua, a unified framework leveraging LuaJIT and the Verilog Procedural Interface (VPI), which integrates three core functionalities: Lua-based functional verification, a scripting engine for RTL simulation, and waveform analysis. By enabling complete code reuse through a unified Lua codebase, the framework achieves a 12× speedup in RTL simulation compared to cocotb and a 70× improvement in waveform analysis over state-of-the-art solutions. Through consolidating verification tasks into a single platform, Verilua enhances efficiency while reducing tool fragmentation and learning overhead, addressing critical challenges in modern hardware design.

CFI-SD Career Fair - Industry : Speed dating

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 17:30 CEST - 18:30 CEST
Location / Room: Rhône 1

In this Young People Programme event, attendees can meet the recruiters and exchange business cards and CVs. Recruiters from Cadence Design Systems, Synopsys, Racyics

PhDF PhD forum

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 18:30 CEST - 20:00 CEST
Location / Room: Forum 1&2

Session chair:
Christian Pilato, Politecnico di Milano, IT

Session co-chair:
Dirk Stroobandt, Ghent University, BE

The PhD Forum is a great opportunity for PhD students to present their work to a broad audience in the system design and design automation community from both industry and academia, and to establish contacts for entering the job market. Representatives from industry and academia get a glance of state-of-the-art in system design and design automation. The PhD Forum is hosted by EDAA, ACM SIGDA, and IEEE CEDA.

Time	Label	Presentation Title Authors
18:30 CEST	PhDF.1	ADAPTIVE HARDWARE FOR ENERGY-EFFICIENT FPGA-BASED DATA CENTERS Speaker and Author: Mattia Tibaldi, Politecnico di Milano, IT Abstract Modern applications require the elaboration of massive amounts of data. Due to the computational power needed, such applications may execute in data centers that consume immense energy. In 2020, data centers contributed to 2% of the world's carbon emissions, with an increasing trend. Google recently announced the development of small nuclear reactors to power larger data centers with zero carbon emissions. Although this is a possible direction in the development of sustainable data centers, the solution may not always be applicable as several states impose restrictions on the use of nuclear energy. For this reason, the study of hardware and software solutions does not lose its importance. So, designers must guarantee a high quality of the result and efficiently manage the energy required by the computation to reduce costs and carbon production. Many data centers are moving towards heterogeneous architectures equipped with specialized hardware to achieve high performance and power savings. These architectures with the customization can significantly minimize energy consumption, while hardware parallelism can optimize the execution time. However, such components have limited flexibility. Once designed, they cannot execute the functionality differently. Also, the energy consumption is fixed and depends on the implementation of the architecture. This research proposes implementing an adaptive system based on FPGA to guarantee flexibility, develop different versions of a computation component, and select the version at run time. In this way, based on the stimuli coming from the environment, such as the intensity of the incoming traffic or the data formats, it will be possible to use logic with different energy profiles. This approach allows us to design an accelerator with a power efficiency of 25× respect to a CPU and a 40% final reduction in carbon emissions.
18:30 CEST	PhDF.2	SAFETY CONCEPT OF HIGHLY AUTOMATED DRIVING FOR A TRANSVERSE GUIDANCE SYSTEM Speaker and Author: Marzana Khatun, University of Ulm, DE Abstract The growing interest and demand for automated driving systems has turned self-driving vehicles from science fiction to practical reality. However, Automated vehicles (AVs) face the critical challenge of outperforming human drivers and failed to gain the public's trust. Establishing safety concepts in advance of the development phases is essential for the reliability of automated driving systems. Emerging safety concepts emphasize rigorous proofs, development processes and applicable methods to ensure the safety of such systems. These concepts take into account the inherent complexity of automated technologies and the need for continuous improvement, incorporating both new and refined technologies and methods. This work proposes a general safety concept through: (a) Scenario-based extended Hazard Analysis and Risk Assessment (HARA) for transverse guidance of a vehicle, and used as reference for safety-related threat scenarios description for vehicle functions such as Over-The-Air (OTA) update, (b) Scenario reduction approaches to support simulation-based approval for collision detection use cases focusing on L3 or higher level of automation (L4/L5), and (c) Evaluation of management systems interaction for autonomous vehicles.
18:30 CEST	PhDF.3	LOW-POWER TIME-DOMAIN HARDWARE ACCELERATOR FOR EDGE COMPUTING Speaker: Jie Lou, RWTH Aachen University, DE Authors: Jie Lou and Tobias Gemmeke, RWTH Aachen University, DE Abstract Efficient computing is becoming increasingly crucial for energy-constrained edge devices. With the rapid adoption of artificial neural networks (ANNs), reducing energy consumption has emerged as a pressing research challenge to enable effective edge computing. Time-domain (TD) computing has attracted attention for its inherent analog signal processing properties and compatibility with digital circuits. Unfortunately, it remains unclear which scenarios are best suited for computation in the time domain. This thesis focuses on developing hardware accelerators for time-domain computing, analyzing suitable application domains, and identifying the principles and constraints that should be considered during ASIC implementation. Specifically, we design, tapeout and measure both standard and custom cell-based time-domain compute-in-memory (TDCIM) accelerators for binary neural networks (BNNs) and convolutional neural networks (CNNs), as well as standard cell-based TD decoder for low-density parity-check (LDPC) codes, using 22nm FDSOI technology to validate the performance of time-domain computing. Besides, we also develop a software simulation frameworks that account for hardware TD noise in neuromorphic and LDPC applications.
18:30 CEST	PhDF.4	SIMULATION TECHNIQUES FOR RAPID SOFTWARE DEVELOPMENT AND VALIDATION Speaker: Mohammadreza Amel Solouki, Politecnico di Torino, IT Authors: Mohammadreza Amel Solouki and Massimo Violante, Politecnico di Torino, IT Abstract Ensuring reliability under Random Hardware Failures (RHFs) in safety-critical embedded systems requires robust fault tolerance measures. My research proposes innovative methods for enhancing fault detection and mitigation through Control Flow Checking (CFC) and Software-Implemented Hardware Fault Tolerance (SIHFT) techniques. Hardening strategies are often applied in embedded systems to mitigate RHFs, either by using specialized hardware or employing SIHFT methods. However, most existing approaches in the literature target soft errors and are implemented in low-level languages such as Assembly. This complicates compliance with functional safety standards, which increasingly advocate for high-level programming languages like C. Addressing these challenges, my research focuses on integrating SIHFT methods directly into high- level programming workflows to streamline fault detection and mitigation processes while adhering to industry standards like ISO 26262. This study tackles the practical challenges of implementing fault tolerance in embedded systems, bridging the gap between theoretical models and real-world applications. By leveraging high-level programming languages and adhering to international safety standards like ISO 26262, this work advances the state of the art in embedded system reliability and lays a foundation for future developments in fault-tolerant computing. High-level language implementations simplify adherence to ISO 26262 by enabling better trace ability, easier maintenance, and compliance with mandated software development workflows.
18:30 CEST	PhDF.5	SOFTWARE AND HARDWARE CO-OPTIMIZATION FOR GRAPH NEURAL NETWORKS ON FPGA Presenter: Ruiqi Chen, Vrije Universiteit Brussel, BE Authors: Ruiqi Chen¹, Kun Wang² and Bruno da Silva¹ ¹Vrije Universiteit Brussel, BE; ²Fudan University, Abstract In recent years, field-programmable gate arrays (FPGAs) have demonstrated significant potential in accelerating deep neural networks (DNNs). To extend this success to graph neural networks (GNNs) , which excel at processing non-Euclidean data, both industry and academia have been continuously focusing on it. However, the computation processes of GNNs present a series of irregularities that hinder the design of highly efficient FPGA-based accelerators. From fine-grained to coarse-grained levels, these irregularities manifest at the bit-level, data-structure-level, computation-level, and model-level. To address these challenges, this dissertation proposes novel software-hardware co-optimization techniques to design FPGA-specific architectures that achieve high performance and energy efficiency. First, this dissertation leverages the flexible data paths of FPGAs to design a unified matrix computation array that supports compatibility with different quantized data bit-widths. Then, to address the unique data structure characteristics in GNNs, the computational array's data flow and compression format are further optimized to maximize data reuse and mitigate irregular memory access patterns. Subsequently, various specialized computation modules are developed, complemented by buffering units to support different computational operations. Finally, a dedicated instruction set is designed to coordinate with the overall architecture, enabling rapid switching between different GNN models.
18:30 CEST	PhDF.6	SEMI-TENSOR PRODUCT OF MATRICES AND ITS APPLICATION IN LOGIC SYNTHESIS Speaker: Hongyang Pan, Fudan University, CN Authors: Hongyang Pan¹, Zhufei Chu² and Fan Yang¹ ¹Fudan University, CN; ²Ningbo University, CN Abstract Deep Neural Networks (DNNs) are proliferating in numerous AI applications, thanks to their high accuracy. For instance, Convolution Neural Networks (CNNs), one variety of DNNs, are used in object detection for autonomous driving and have reached or exceeded the performance of humans in some objection detection problems. Commonly adopted CNNs such as ResNet and MobileNet are becoming deeper (more layers) while narrower (smaller feature maps) than early AlexNet and VGG. Nevertheless, due to the race for better accuracy, the scaling up of DNN models, especially Transformers — another variety of DNNs, to trillions of parameters and trillions of Multiply-Accumulate (MAC) operations, as in the case of GPT-4, during both training and inference, has made DNN models both data-intensive and compute-intensive, carrying heavier workloads on the memory capacity to store weights and computation. This poses a significant challenge for the deployment of these models in an area-efficient and power-efficient manner. Given these challenges, model compression is a vital research topic to alleviate the crucial difficulties of memory capacity from the algorithmic perspective. Pruning, quantization, and entropy coding are three directions of model compression for DNNs. The effectiveness of pruning and quantization can be enhanced with entropy coding for further model compression. Entropy coding focuses on encoding the quantized values of weights or features in a more compact representation by utilizing the peaky distribution of the quantized values, to achieve a lower number of bits per variable, without any accuracy loss. Currently employed Fixed-to-Variable (F2V) entropy coding schemes such as Huffman coding and Arithmetic coding are inefficient to be decoded in the hardware platforms, suffering from high decoding complexity of O(n · k), where n is the number of codewords (quantized values) and k is the reciprocal of compression ratio. Read more ...
18:30 CEST	PhDF.7	EXPLORING LAYER-FUSED MAPPING OF DNNS ON HETEROGENEOUS DATAFLOW ACCELERATORS Speaker: Arne Symons, KU Leuven, BE Authors: Arne Symons¹ and Marian Verhelst² ¹MICAS, KU Leuven, BE; ²KU Leuven, BE Abstract The rapid advancements in deep neural networks (DNNs) have led to increased computational complexity, memory demands, and energy consumption, posing significant challenges for edge applications. Heterogeneous dataflow accelerators (HDAs), leveraging multi-core and chiplet-based architectures, offer specialized processing for diverse DNN workloads. However, traditional layer-by-layer scheduling strategies often result in high off-chip memory traffic and underutilized cores. This work introduces Stream, a design space exploration framework for optimizing layer-fused DNN mappings on HDAs. Layer fusion, or depth-first scheduling, minimizes off-chip data transfers and enhances core utilization by processing outputs through a stack of fused layers. Stream integrates fine-grained dependency modeling, memory- and communication-aware performance analysis, and a constraint-based optimization engine to deliver significant improvements in latency and energy efficiency. Stream's efficacy is validated on state-of-the-art architectures, achieving >95% accuracy in latency modeling for accelerators like DepFiN and DIANA. Comparative studies show up to 2.2× energy-delay product (EDP) improvements for activation-dominant workloads like MobileNetV2 under layer fusion. Additionally, scalability studies highlight Stream's adaptability to hardware configurations, demonstrating optimal core distributions as processing element budgets increase. All methodologies and results are open-source, enabling further innovation and adoption: https://github.com/kuleuven-micas/stream.
18:30 CEST	PhDF.8	INVESTIGATING SECURITY ISSUES IN PROGRAMMABLE LOGIC CONTROLLERS AND RELATED PROTOCOLS Speaker and Author: Wael Alsabbagh, IHP – Leibniz Institute for High Performance Microelectronics, DE Abstract Programmable Logic Controllers (PLCs) play a substantial role in Critical Infrastructures (CIs) and Industrial Control Systems (ICSs). They are programmed with a control logic program that determines how to control and operate physical processes such as nuclear power plants, petrochemical factories, water treatment systems, and many others. Unfortunately, these devices are not fully secured and remain vulnerable to malicious attacks, particularly those targeting the control logic of PLCs. Such threats, known as control logic injection attacks, are designed to manipulate industrial processes, potentially causing catastrophic damages, as exemplified by the Stuxnet attack [1]. This thesis investigates various security issues and vulnerabilities associated with PLCs and their communication protocols, with a primary focus on control logic injection attacks. Our objective is to analyze the security mechanisms of both non-cryptographically and cryptographically protected PLCs, assessing the effectiveness of vendor-implemented safeguards. Siemens PLCs were selected for experimentation due to their widespread use in industrial environments. Figure 1 illustrates the methodology employed in this study, outlining our research workflow and experimental steps.
18:30 CEST	PhDF.9	SECURE AND SCALABLE HARDWARE FOR POST-QUANTUM CRYPTOGRAPHY AND FULLY HOMOMORPHIC ENCRYPTION Speaker and Author: Aikata Aikata, TU Graz, AT Abstract Secure communication and privacy-preserving computation are the cornerstones of modern-day digital interactions. As the world becomes more interconnected, ensuring the confidentiality and integrity of both communication and computation is essential to safeguarding sensitive data and maintaining trust in digital systems. With the advent of quantum computing, traditional public key cryptographic schemes face obsoletion, making it crucial to develop new technologies that can uphold these pillars in a quantum-enabled future. Thus, this thesis focuses on secure communication and privacy-preserving computation through advancements in Post-Quantum Cryptography (PQC) and Fully Homomorphic Encryption (FHE). Efficient, compact, and secure hardware architectures for PQC are developed. Key contributions include the first unified hardware designs for NIST-standardized Digital Signature (CRYSTALS-Dilithium) and Key Encapsulation (CRYSTALS-Kyber), compact and resource-efficient implementations, agile designs that accommodate future algorithmic changes, and defence techniques against side-channel attacks (via masking). The research also addresses the challenges of cost-effective hardware acceleration for FHE to enable efficient privacy-preserving computation. In this direction, a major highlight is the pioneering scalable multi-chiplet architectures that achieve significant performance gains while reducing fabrication costs by 50%. The thesis also introduces the first hardware implementation of a Hybrid Homomorphic Encryption (HHE) scheme- Pasta, achieving 97x speedup over existing solutions. Furthermore, new fault analysis techniques have been developed to emphasize the need for continued security research. The work further optimizes FHE applications for privacy-preserving neural network evaluations. To summarize, this thesis develops secure and scalable hardware solutions for advanced cryptographic techniques- PQC and FHE which play a key role in the adoption of digital security and privacy in the post-quantum era. Several proposed works have also been open-sourced to foster further innovation in this domain.
18:30 CEST	PhDF.10	LEARNING-BASED METHODS FOR ENABLING ON-EDGE, ACCURATE, SUSTAINABLE, AND HUMAN-CENTERED INTELLIGENT MANUFACTURING Speaker: Luigi Capogrosso, Università di Verona, IT Authors: Luigi Capogrosso, Marco Cristani and Franco Fummi, Università di Verona, IT Abstract Four major evolutions of industrialization have occurred throughout human history, impacting economic growth, population expansion, and significant social transformations. Industry 5.0 is regarded as the next industrial revolution, and its objective is to leverage the creativity of human experts in collaboration with efficient, accurate, and intelligent machines. In this context, the transformation of industrial resources into intelligent objects capable of sensing, acting, and adapting leads to intelligent manufacturing. To comprehensively enhance manufacturing systems capabilities, this thesis presents cutting-edge learning-based techniques around four key pillars of intelligent manufacturing: efficient edge-cloud computing, accurate anomaly detection, sustainability, and human-centered systems design. The results obtained are shown in Figure 1, which presents the real-world setup of the Industrial Computer Engineering (ICE) Laboratory of the University of Verona, where the presented contributions were tested and evaluated.
18:30 CEST	PhDF.11	HIGH-DENSITY AND RELIABLE COMPUTE-IN-MEMORY CIRCUITS AND ARCHITECTURES FOR BIG DATA PROCESSING Speaker: Hongtao Zhong, Department of Electronic Engineering, Tsinghua University, CN Authors: Hongtao Zhong and Xueqing Li, Tsinghua University, CN Abstract Recent big data applications require process massive data with limited time or power budget, but conventional Von Neumann architecture suffer from the "memory wall" challenge. Computing-in-Memory (CIM) technology is an emerging computing paradigm and promising to overcome this challenge. Besides, content addressable memory (CAM) that is originally used in cache and routing has been discovered to be capable of in-memory searching that can accelerate many search applications. Although existing CiM/CAM designs have shown significant energy efficiency improvement, the low memory density and low reliability are still two big obstacles on the path towards "real computing/searching in memory". To address these challenges, we propose a series of cross-layer explorations from devices to architectures and applications, including the following three parts: i) High-density and low-power eDRAM memory and CiM circuits based on NEM relay and AFeFETs with low refresh overhead; ii) Energy-efficient and reliable charge-domain CiMs and CAMs with high memory density thanks to the proposed dense cell and cluster design; iii) High-density 3D memory based domain-specific architecture (DSA) designs with a series of algorithm-architecture optimizations that achieve end-to-end high speedup and high energy efficiency. These works above push the frontiers towards higher density, higher reliability, and further more practical CiM/CAM designs.
18:30 CEST	PhDF.12	HARDWARE RELIABILITY ASSESSMENT AND ENHANCEMENT FOR DEEP NEURAL NETWORKS Speaker and Author: Mohammad Hasan Ahmadilivani, Tallinn University of Tehnology, EE Abstract Due to the high capabilities of DNNs in solving various tasks, they are widely adopted in safety-critical applications such as automotive, space, and healthcare. A major concern in designing a system for such use cases is hardware reliability. To address the hardware reliability concerns in DNN deployment, their fault resilience should be first assessed and then enhanced. With the growth of DNN exploitation, the size of emerging DNNs in terms of the amount of parameters and computations is rapidly rising. It poses a huge complexity to their reliability assessment and enhancement, necessitating efficient and innovative solutions to reduce complexity and overheads. In this thesis, some of the most significant challenges of reliability assessment and enhancement for DNNs are identified and addressed to enable the exploitation of DNNs in safety-critical applications. My thesis presents the first Systematic Literature Review (SLR) focused exclusively on all methods of reliability assessment for DNNs, exploring the methods of reliability assessment for DNNs, classifying them, and identifying the existing gaps and challenges in the field. For reliability assessment, it addresses the scalability problem for the first time by introducing a novel semi-analytical and metric-oriented method. Moreover, this thesis introduces multiple cost-effective fault-tolerant techniques for DNNs, applicable to a wide range of DNN accelerators. Many methods in this thesis are open-source to enable researchers and engineers in this field, to quickly evaluate DNNs' reliability and design fault-tolerant DNNs.
18:30 CEST	PhDF.13	TOWARD RELIABLE AI ACCELERATORS Speaker: Eleonora Vacca, Politecnico di Torino, IT Authors: Eleonora Vacca and Luca Sterpone, Politecnico di Torino, IT Abstract Deploying deep neural networks (DNNs) in safety-critical systems, such as autonomous vehicles and medical diagnostics, demands high performance and reliability. Traditional approaches to enhance reliability, such as hardware redundancy, impose significant computational and energy overheads, making them unsuitable for practical use, especially in large-scale or resource-constrained systems. This research proposes a novel hardware-software co-design strategy to improve the reliability of Systolic Array (SA) accelerators, a key component for efficient DNN computation. The approach introduces error self-detection mechanisms that fully utilize the existing functional paths of the accelerator, eliminating the need for additional hardware. Furthermore, zero-overhead algorithmic techniques are developed to mitigate faults by leveraging insights into fault propagation and system behavior. These innovations enhance the fault tolerance of SA accelerators without increasing computational, memory, or energy costs, providing a scalable solution for reliable DNN performance in critical applications.
18:30 CEST	PhDF.14	DESIGN AND SIMULATION OF ATOMIC-SCALE COMPUTING: BRIDGING COMPUTER SCIENCE, ELECTRICAL ENGINEERING, AND PHYSICS Speaker: Jan Drewniok, TU Munich, DE Authors: Jan Drewniok and Robert Wille, TU Munich, DE Abstract As AI and digital transformation accelerate, traditional computing architectures struggle to meet the soaring demand for energy efficiency. Silicon Dangling Bond (SiDB) technology stands out as a post-CMOS candidate, offering robust, scalable, and energy-efficient computing. However, despite its immense potential to revolutionize atomic-scale computing, the progress of SiDB technology has been hindered by a lack of interdisciplinary collaboration among computer scientists, electrical engineers, and physicists. This disconnect, caused by the absence of shared design rules and software tools to enforce interdisciplinary requirements, has limited the integration of hardware designs with computational strategies. To address these challenges, this thesis introduces a comprehensive framework for the SiDB technology that bridges these interdisciplinary gaps. Key contributions include the development of highly efficient physical simulators, achieving runtime improvements of up to a factor of 5000, and SiDB logic design algorithms with a runtime improvement of up to a factor of 63. Moreover, the thesis proposes the establishment of design rules---such as temperature behavior, defect analysis, and operational domain exploration---together with efficient algorithms to determine them for the first time for the SiDB technology. These advancements enable the automatic design of realistic and robust SiDB circuits, paving the way for real-world applications. In an effort to support open research and reproducibility, all aforementioned methodologies have been implemented into open-source tools and made publicly available on GitHub and PyPI. By bridging disciplines through this comprehensive framework, the thesis positions the SiDB technology as a viable and sustainable solution to address the escalating computational and energy demands of the future.
18:30 CEST	PhDF.15	LEARNING-BASED ANALOG ICS LAYOUT AUTOMATION Speaker and Author: Davide Basso, University of Trieste, IT Abstract Analog integrated circuits layout has always been a challenging task, requiring sophisticated manual expertise to achieve optimal results. This thesis proposes a novel approach to streamline and accelerate this procedure by leveraging machine learning techniques. Specifically, we utilize a reinforcement learning agent to sequentially place devices on a chip canvas. The placement process is complemented by a Steiner tree-based global routing algorithm for driving connectivity. To enhance generalization capabilities, our pipeline uses a graph neural network, ensuring robust performance across various layout scenarios. This innovative approach is seamlessly integrated into Infineon's procedural layout generator, enabling users to maintain high-quality standards while significantly reducing manual effort. Experimental results demonstrate the efficiency of our method, achieving a reduction in complete layout generation runtimes to 67.3% compared to traditional manual techniques.
18:30 CEST	PhDF.16	PHYSICAL DESIGN FOR FIELD-COUPLED NANOCOMPUTING Speaker: Simon Hofmann, Chair for Design Automation, TU Munich, DE Authors: Simon Hofmann and Robert Wille, TU Munich, DE Abstract The growing demand for computational power, coupled with the limitations of Moore's Law and rising energy consumption of CMOS technologies, necessitates alternative computing paradigms. Field-coupled Nanocomputing (FCN) offers a promising solution by utilizing the repulsion of physical fields instead of electrical current for ultra-low-power computation at the nanoscale. Recent advances, such as sub-30 nm OR gates using Silicon Dangling Bonds (SiDBs), have renewed interest in FCN. However, constraints like planarity requirements, complex clocking schemes, and the need for signal synchronization pose significant challenges in physical design. This thesis addresses these challenges by developing novel physical design algorithms and tools to enhance the efficiency and scalability of FCN circuit design. We introduce NanoPlaceR, a reinforcement learning-based tool that reduces layout area by 50% compared to prior methods. Building upon this, we present gold, an algorithm that further reduces area overhead by 24% and accelerates the design process by 460 times. To enable cross-technology compatibility, we develop an algorithm that transforms layouts between Cartesian grids (used in Quantum-dot Cellular Automata) and hexagonal grids (required by SiDB gates), bridging different FCN technologies without extensive redevelopment. Furthermore, we propose post-layout optimization and wiring reduction techniques tailored to FCN, achieving additional area savings. We also introduce MNT Bench, a comprehensive benchmark suite providing gate-level layouts and network descriptions, and implement all methodologies in open-source tools within the Munich Nanotech Toolkit (MNT), promoting reproducibility and collaboration in FCN design automation. These contributions advance the state-of-the-art in FCN physical design, providing scalable and efficient solutions crucial for realizing FCN technologies in the post-CMOS era.
18:30 CEST	PhDF.17	TOWARDS SOUND AND COMPLETE ANALYSIS OF INTEGRATED CIRCUITS AT TRANSISTOR-LEVEL Speaker: Oussama Oulkaid, Université Grenoble Alpes, FR Authors: Oussama Oulkaid¹, Matthieu Moy², Pascal Raymond³, Bruno Ferres⁴ and Mehdi Khosravian⁵ ¹University Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France - University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France - Aniah, 38000 Grenoble, France, FR; ²University Lyon, EnsL, UCBL, CNRS, Inria, LIP, F-69342, LYON Cedex 07, France, FR; ³University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, France, FR; ⁴University Grenoble Alpes, CNRS, Grenoble INP, VERIMAG, 38000 Grenoble, FR; ⁵Aniah, FR Abstract Circuit verification is an undoubtedly complex task. It is both costly and time-consuming, e.g., 50–60 % is the median time spent in the verification of Application Specific Integrated Circuit (ASIC) designs with respect to the whole project time. In this work, we focus on a specific aspect of circuit verification, that is the verification of electrical properties at transistor-level. We present transistor-level semantics, and we show how they can be used in the context of electrical verification. We demonstrate the use of our approach for missing level-shifter detection, and we present prospects for extending the work to a form of reliability analysis.
18:30 CEST	PhDF.18	DYNAMIC MEMORY MANAGEMENT OPTIMIZATIONS OVER HETEROGENEOUS MEMORY SYSTEMS Speaker: Manolis Katsaragakis, National TU Athens, GR Authors: Manolis Katsaragakis¹, Francky Catthoor² and Dimitrios Soudris¹ ¹National TU Athens, GR; ²IMEC, BE Abstract This PhD focuses on the development of systematic methodology for providing source code organization, data structure refinement, exploration and placement over emerging memory technologies. The goal is to extract alternative solutions, aiming to provide multi-criteria trade-off over different optimization aspects, such as memory footprint, accesses, performance and energy consumption.
18:30 CEST	PhDF.19	SYSTEM-LEVEL DESIGN IN THE ERA OF BRAIN-COMPUTER INTERFACES Speaker: Guy Eichler, Columbia University, US Authors: Guy Eichler and Luca Carloni, Columbia University, US Abstract Brain-computer interfaces (BCIs) have emerged since the 1960s, promising to provide a direct channel between the brain and the outside world. With advancements in Internet-of-Things (IoT) and machine learning (ML) at the edge, implant-based BCI systems capable of acquiring neural signals from the brain, communicating wirelessly, and executing BCI applications have recently transitioned to the forefront of research and development. However, BCI systems must operate under strict safety requirements due to the sensitivity of brain tissue to heat, while handling the ever-growing volumes of neural data needed to enhance our understanding of the brain. This challenge pushes the limits of wireless transmission data rates, power consumption, and the overall feasibility of implant-based BCI systems. Consequently, to date, no implant-based BCI system capable of running full BCI applications has been successfully tested in-vivo (in live subjects), proven safe, and made available to the public. Thus, my thesis is that to unlock the full potential of the BCI field, BCI system development must be properly defined and standardized, with a clear target BCI system, an understanding of its constraints, and a distinction between three overlapping time domains that emphasize different levels of research and development, with progress continuing concurrently in each domain. I support my thesis through a three-dimensional approach that restructures system-level design in the BCI field. Each dimension corresponds to a time domain, representing an independent area of contribution to research and development: Pre-BCI -- Designing implant-based BCI systems that support large-scale data acquisition, wireless communication and a high-level of configurability. Intra-BCI -- Developing system-on-chip (SoC) design methodologies to integrate real-time computation and complete BCI applications into BCI systems. Post-BCI -- Specializing computational data flows to interact seamlessly with the biological brain through BCI algorithm-hardware co-design and integrating brain-inspired computation in heterogeneous SoCs.
18:30 CEST	PhDF.20	ML-BASED RESOURCE MANAGEMENT OF RECONFIGURABLE SYSTEMS IN THE CLOUD-EDGE CONTINUUM Speaker: Juan Encinas, Universidad Politécnica de Madrid, ES Authors: Juan Encinas¹, Alfonso Rodríguez² and Andres Otero³ ¹Universidad Politécnica de Madrid, ES; ²UPM, ES; ³Universidad Politecnica de Madrid, ES Abstract Field-Programable Gate Arrays (FPGAs) are commonly used in the embedded domain because they provide better energy efficiency than a Graphical Processing Unit and a competitive performance compared to Application Specific Integrated Circuit. Moreover, Dynamic and Partial Reconfiguration can be used to modify part of the implemented logic at run time without interfering the rest of the system, obtaining an outstanding flexibility. Traditionally, in reconfigurable embedded systems, the applications to be accelerated and the relationships between them are known at design time and therefore, a design space exploration process is typically performed to decide when to reconfigure each accelerator to maximize performance and reduce power consumption. However, there are other scenarios where workloads (i.e., arrival order of a diverse set of accelerators) are not known at design time. This is the case of computing offloading scenarios where the FPGAs are placed in the cloud-edge continuum and work as acceleration engines. In these scenarios, FPGAs usually deal with dynamic workloads where requirements vary on demand, requiring run-time decisions to keep the optimal operating point, preserving the expected Quality of Service and power constrains. In order to make more informed decisions, the hardware accelerators must be characterized in terms of power consumption and execution performance. Doing this analytically is unfeasible due to the large amount of variables involved on a real scenario where multiple accelerators are executed simultaneously. In this thesis, models based on ML techniques are proposed as a mechanism to predict power consumption and performance in reconfigurable multi-accelerator systems, since ML algorithms are particularly good at finding these complex relationships between multiple factors. Specifically, an incremental modeling approach has been implemented to characterize at run time upcoming workloads, updating the prediction models with new observations. A smart scheduler is also included to make resource management decisions based on the predictions of the incremental ML-based models. In addition, complementary infrastructures are also proposed for managing the dynamic workloads in FPGAs and monitoring the system, collecting the power consumption and performance traces used to train the models. Moreover, this solution has been design following a microservice-based approach to enable the seamless deployment of hardware-accelerated functions to any platform across the continuum
18:30 CEST	PhDF.21	POWER, PERFORMANCE, AND THERMAL TRADE-OFFS IN MANYCORE ARCHITECTURES Speaker and Author: Gaurav Narang, Washington State University, US Abstract Non-Volatile Memory (NVM) based crossbars suffer from various non-idealities that affect the overall inferencing accuracy. To address that, the matrix-vector-multiplication operations are computed by activating a subset of the full crossbar, referred to as Operation Unit (OU). However, OU configurations (sizes) vary with the neural layers' features such as sparsity, kernel size, and their impact on predictive accuracy. We consider the problem of learning appropriate layer-wise OU configurations in ReRAM crossbars for unseen DNNs at runtime such that the performance is maximized without loss in predictive accuracy. We develop a machine learning (ML) based framework called Odin, which selects the OU sizes for different neural layers as a function of the neural layer features and time-dependent ReRAM conductance drift. Our experimental results demonstrate that the EDP is reduced by up to 8.7× over state-of-the-art homogeneous OU configurations without compromising predictive accuracy.
18:30 CEST	PhDF.22	SOLVING COMBINATORIAL OPTIMIZATION PROBLEMS IN CAD WITH RRAM-BASED UNIVERSAL ISING MACHINE Speaker: Wenshuo Yue, Peking University, CN Authors: Wenshuo Yue and Bonan Yan, Peking University, CN Abstract Ising machines are annealing processors that leverage the physical dynamics of Ising graphs to address combinatorial optimization problems (COP). Nevertheless, these machines are constrained to problems with specific graph topologies due to their inherent fixed spin configuration and connectivity. This thesis work explores hardware-software co-design approaches to develop a novel paradigm of hardware Ising machine, the universal Ising machine (UIM), enabled by resistive random-access memory (RRAM) and Compute-in-memory (CIM) technology. It effectively accelerates solving COP in computer-aided design (CAD) problems. (1) This work designs and fabricates a multifunctional RRAM chip to integrate content-addressable memory, compute-in-memory, and random number generator in one chip. (2) This work proposes a novel paradigm, a universal Ising machine, that supports arbitrary Ising graph topology with adaptive low-cost hardware. The approach, interaction-centric storage, is suitable for any Ising graph and reduces the memory scaling cost. We experimentally implement the Ising machine on a 40nm RRAM CIM chip. (3) This work proposes a hardware-software co-design technique that, for the first time, maps a practical CAD problem into UIM. We use the UIM to solve max-cut and graph coloring problems, with the latter showing a 442–1450 factor improvement in speed and a 4.1e5–6.0e5 factor reduction in energy consumption compared to GPU. When applying to a realistic CAD problem, multiple patterning lithography layout decomposition, the UIM achieves 390–65,550 times speedup compared to the ILP algorithm on CPU.
18:30 CEST	PhDF.23	SYSTEMATIC DESIGN AND EFFICIENT AUTOMATED IMPLEMENTATION OF LOGIC LOCKING Speaker: Akashdeep Saha, New York University Abu Dhabi, AE Authors: Akashdeep Saha¹, Debdeep Mukhopadhyay² and Rajat Subhra Chakraborty² ¹New York University Abu Dhabi, AE; ²IIT Kharagpur, IN Abstract The escalating costs of IC fabrication have driven the adoption of fabless operations and a "horizontal" business model, emphasizing outsourcing across the IC supply chain. While this approach reduces production costs and accelerates time-to-market, it introduces vulnerabilities that result in billions of dollars in losses due to threats such as IP piracy [1], infringement, IC overproduction, and hardware Trojan insertions. Logic locking has emerged as a proactive defense mechanism, protecting designs by integrating key-based logic to thwart these potent supply chain threats. In my PhD, we explored advances in logic locking across three aspects. First, it identifies vulnerabilities in sequential and combinational advanced state-of-the-art locking techniques. We introduce novel attacks like ORACALL [7] and DIP Learning [5], compromising Cellular Automata (CA)-based FSM obfuscation and CAS-Lock, respectively. Secondly, it proposes countermeasures by enhancing the non-linearity of CA structures [8] to mitigate attacks without affecting their crucial properties and utilized cryptographic SPN-based block cipher's security in designing robust logic locking. Finally, we present MIDAS, an end-to-end CAD framework automating logic locking techniques across multiple paradigms. By unifying diverse approaches and leveraging graph-based analysis, MIDAS establishes a robust foundation for scalable, secure logic obfuscation.
18:30 CEST	PhDF.24	AGING PHENOMENA IN DIGITAL CIRCUITS: CHARACTERIZATION, MITIGATION AND EXPLOITATION. Speaker: Andres Santana Andreo, Instituto de Microelectrónica de Sevilla, IMSE, CNM (CSIC, Universidad de Sevilla), ES Authors: Andres Santana Andreo¹, Rafael Castro Lopez², Elisenda Roca² and Francisco Fernandez² ¹Instituto de Microelectrónica de Sevilla, IMSE, CNM (CSIC, Universidad de Sevilla), ES; ²Instituto de Microelectronica de Sevilla, IMSE, CNM (CSIC, Universidad de Sevilla), ES Abstract The enormous benefits that CMOS technology scaling has brought have come along with an increase in variability. Not only Time-Zero Variability (TZV), existing after fabrication, but Time-dependent variability (TDV) effects, like aging, are becoming more relevant and damaging and need to be considered during circuit design. Some examples of TDV phenomena are Bias Temperature Instability (BTI) or Hot Carrier Degradation (HCD). These phenomena show a stochastic nature, which makes them much harder to model. To address this issue, stochastic defect-centric models such as the Probabilistic Defect Occupancy (PDO) model are used. Parameter extraction for these models requires massive device characterization so that statistically significant information is obtained. Once these parameters are obtained, the model can be integrated in a simulation tool, and circuit reliability predictions made to prevent the impact of aging on the final design. Accuracy is critical, as overcompensation leads to unnecessary performance loss and undercompensation to early circuit failure. Specifically, in digital circuits, aging generally results in a longer propagation delay for logic gates, ultimately leading to potential timing violations. This thesis tackles the issue of TDV in digital circuits around three pillars: Characterization (by employing a novel chip design to characterize the aging degradation of individual logic gates), Modelling (by accurately modeling the circuit degradation under complex workloads with advanced compression techniques, introducing accurate guardbands into the design flow) and Exploitation (by employing the knowledge on TDV to produce reliable and cheap hardware security primitives).
18:30 CEST	PhDF.25	DIGITAL TWINS IN AIRCRAFT: MERGING CYBER-PHYSICAL SYSTEM AND HUMAN DECISION-MAKING Speaker: Francesco Biondani, Università di Verona, IT Authors: Francesco Biondani and Franco Fummi, Università di Verona, IT Abstract The aviation industry is undergoing a profound digital transformation fueled by advancements in Artificial Intelligence (AI), the metaverse, and cybersecurity. At the forefront of this transformation are Digital Twins (DT), which hold immense potential for enhancing operational efficiency and safety. However, implementing Digital Twins on resource-constrained, in-service aircraft presents significant challenges. This research addresses these challenges from two complementary perspectives: Cyber-Physical Systems (CPS) and human-centered design. From the CPS perspective, a power-efficient digital twin framework has been developed and tailored specifically for predictive maintenance. Concurrently, the research leverages the metaverse to collect edge-case data and simulate human behavior in decision-making scenarios, bridging technological innovation with human factors to advance aviation safety and efficiency.
18:30 CEST	PhDF.26	FAULTY BEHAVIORS SIMULATION IN INDUSTRIAL CYBER-PHYSICAL SYSTEMS FOR SAFETY ANALYSIS Speaker and Author: Francesco Tosoni, Università di Verona, IT Abstract Recently, industrial evolution has been on the rising edge due to the Industry 4.0 phenomenon. The Industrial Cyber-Physical Systems (ICPSs) that compose smart factories are increasingly complex and interconnected among each other and humans. In such a context, functional safety is crucial for production, economic and legal reasons. Maintaining the correctness of the system functionality is achieved by monitoring the machine status and key parameters during its working phase. Creating virtual models and behavioral simulations are powerful tools for producing solid ICPSs, and the safety measures required by such environments. Despite the complexity of creating these models, simulation is the key to the design of not only the main system but also the surrounding production environment. In order to analyze the system's behavior, multi-domain behavioral fault taxonomies have been produced and tested in simulation on different case studies. Fault injection and simulation methodology have been applied in the Verilog-AMS environment, as well as Simulink and SystemC. In addition, an exploration of the potential of game engines as simulators of physical systems is ongoing, due to the high accuracy at the graphics rendering level. The same fault models have also been useful for developing Time-Sensitive Behavioral Contracts (TSBC) based fault detection mechanisms. Simulation models of the system under analysis enable the design and refinement of the contracts defined in the monitors. Future developments involve applying the same methodology to mixed-signal systems, thus including the system control part as well.
18:30 CEST	PhDF.27	HIGH-PERFORMANCE AND FLEXIBLE HARDWARE ARCHITECTURES FOR FPGA-BASED SMARTNICS Speaker: Klajd Zyla, TU Munich, DE Authors: Klajd Zyla and Andreas Herkersdorf, TU Munich, DE Abstract A recent approach the research community has proposed to address the rise in computing demands associated with the significant growth of network traffic is in-network computing. This paradigm shift is bringing about an increase in the number of tasks executed by network devices. As a result, processing demands are becoming more diverse, requiring flexible packet-processing architectures. State-of-the-art approaches provide a high degree of flexibility at the expense of performance for complex computations, or they ensure high performance but only for specific use cases. In my PhD thesis work, I proposed and developed high-performance and flexible hardware architectures tailored for FPGA-based SmartNICs, including a novel crossbar switch design and a novel NoC router design. I conducted experiments with synthetic and real-world network traffic to demonstrate their feasibility and advantages compared with state-of-the-art approaches. I focused on the following metrics: throughput, latency, and FPGA resource usage.
18:30 CEST	PhDF.28	THE ACCELERATION OF GAUSSIAN BELIEF PROPAGATION USING RECONFIGURABLE HARDWARE Speaker and Author: Omar Sharif, Imperial College London, GB Abstract Gaussian Belief Propagation (GBP) is an iterative method of performing probabilistic inference over factor graphs. Factor graphs, which represent relationships between variables and factors as bipartite structures, enable efficient statistical inference through message-passing algorithms. GBP is one such algorithm which finds extensive application in domains such as simultaneous localization and mapping (SLAM) and image denoising, where approximate solutions to joint probability distributions are sufficient, making it a promising candidate for hardware acceleration in modern robotic systems. Despite its utility, GBP faces significant compute challenges when scaled to large graphs, especially in hardware-constrained environments. Our previous work during the PhD (which was featured in DATE 2024) featured a framework for designing scalable GBP processors using streaming architectures to process large graphs effectively. Our framework achieved remarkable improvements in performance efficiency (i.e. inference per watt), making it an extremely desirable solution for edge applications. However, scalability limitations remained. To address this, our current work (to be featured in DATE 2025) introduces a novel scheduler based on information gain in message passes to prioritize node updates and therefore reduce wasted computations. By dynamically prioritizing nodes for update, and double buffering stream inputs, we achieve significant improvements in both processing and convergence rates for equal resources.
18:30 CEST	PhDF.29	DOMAIN-SPECIFIC BENCHMARKS AND ARCHITECTURES FOR APPLICATIONS USING GRAPH-BASED DATA Speaker: Andrew McCrabb, University of Michigan, US Authors: Andrew McCrabb and Valeria Bertacco, University of Michigan, US Abstract Graph processing is foundational to modern applications like social networks, recommendation systems, and machine learning. In this dissertation, we identify that these applications span three distinct categories: graph-as-data-framework, graph-as-algorithmic-framework, and graph-as-both-frameworks, each presenting unique computational challenges to improve performance. Graph-as-data-framework applications, such as PageRank, require improved memory bandwidth and data organization. Graph-as-algorithmic-framework applications, like Random Forests, demand more parallelism and bandwidth. Graph-as-both-framework applications, exemplified by Graph Neural Networks (GNNs), require a combination of all three. To improve the performance of these applications, this dissertation introduces three custom Processing-in-Memory (PIM) hardware accelerator designs, each tailored to one category of graph application. DREDGE addresses dynamic graph workloads through adaptive vertex partitioning to reduce communication overhead. ACRE accelerates tree-based ensemble learning while enabling explainable models. GLEAM optimizes GNN aggregation functions for enhanced efficiency and scalability. Finally, to support future advancements, this work presents DyGraph and BeXAI benchmark suites for evaluating dynamic graph processing and explainable AI tasks, respectively. Together, these contributions help solve key challenges toward improving the performance of graph-based processing applications using both current and future memory technologies.
18:30 CEST	PhDF.30	ENERGY-EFFICIENT MIXED-SIGNAL IN-SENSOR AND IN-MEMORY COMPUTING Speaker: Md Abdullah-Al Kaiser, University of Wisconsin-Madison, US Authors: Md Abdullah-Al Kaiser¹ and Akhilesh jaiswal² ¹University of Wisconsin-Madison, US; ²U Wisconsin Madison, US Abstract Modern computing systems face two critical challenges: the cognitive wall and the memory wall. The cognitive wall represents the challenge faced by edge devices in Internet of Things (IoT) and artificial intelligence (AI) applications when trying to process large amounts of sensory data while operating with limited power and efficiency. The memory wall, on the other hand, highlights the growing performance gap between fast processors and slower memory access, leading to inefficiencies and increased energy consumption. These challenges arise from the conventional separation of sensors, memory, and processing units, which requires frequent data transfers between these components. This segmentation not only leads to higher energy consumption and processing delays but also impedes the efficiency of data transfer. To address these bottlenecks, there is a critical need for more integrated solutions that embed computation directly within sensors and memory. Hence, this research introduces two innovative solutions:- (1) In-sensor computing through a hybrid CMOS+X architecture for neuromorphic vision sensors (NVS), combining CMOS transistors with magnetic domain-wall magnetic tunnel junctions (MDW-MTJs) for parallel, asynchronous, and energy-efficient computation at the pixel level. This approach reduces backend-processor energy consumption by 45.3%, while maintaining high accuracies of 97.82% on NMNIST, 79.17% on CIFAR10-DVS, and 95.99% on IBM DVS128-Gesture. (2) In-memory computing through a differential cross-coupled photonic SRAM (pSRAM)-augmented photonic tensor core for ultra-fast, low-energy matrix computations. The pSRAM achieves read/write speeds of 20 GHz with a switching energy of just 0.6 pJ, significantly improving matrix multiplication speed and efficiency. By embedding computation directly into sensors and memory, this work effectively addresses both the cognitive and memory walls, leading to significant energy savings and enhanced system performance. These integrated solutions offer a promising path forward for next-generation, energy-constrained, and data-intensive applications, particularly in fields such as IoT and AI.
18:30 CEST	PhDF.31	PRINTED NEUROMORPHIC COMPUTING FOR ULTRA-RESOURCE-CONSTRAINED EDGE INTELLIGENCE Speaker: Priyanjana Pal, Karlsruhe Institute of Technology, DE Authors: Priyanjana Pal and Mehdi Tahoori, Karlsruhe Institute of Technology, DE Abstract With the evolution of next-generation electronics, expec- tations for fast-moving-consumer-goods (FMCG) electronics have grown significantly. In applications like on-skin electron- ics, such as smart band-aids, comfort and biocompatibility are key concerns, while in other areas, such as smart packaging and smart labels, the demand for ultra low-cost, disposable electronics has become essential. Traditional silicon-based electronics, although have significantly evolved in recent years, remain limited by their bulky substrates and complex man- ufacturing processes, making them unsuitable for these new demands. Printed Electronics has emerged as a promising alternative, using simple manufacturing techniques by deposit- ing functional inks onto flexible substrates, reducing manu- facturing costs, time, and enabling features like nontoxicity, flexibility, and biodegradability. However, PE face challenges due to their larger feature sizes and lower device counts, necessitating analog signal processing to bypass expensive ADCs costs. Addressing the inherent challenges of variability, fault tol- erance, and energy limitations in printed electronics requires robust design strategies to ensure reliability and performance. The main aim of this dissertation is to design and optimize printed neuromorphic circuits (pNCs) for robust, energy- efficient, and scalable applications in IoT, wearables, and edge computing. By addressing variability, fault tolerance, and manufacturing constraints, it leverages methods like Neural Architecture Search (NAS), energy-efficient computing, and adaptive mechanisms for temporal data processing to develop cost-effective, reliable, and bespoke pNCs for next- generation ultra-low-cost flexible electronics.
18:30 CEST	PhDF.32	IMPLEMENTATION AND EVALUATION OF DIFFERENT STRATEGIES OF COUNTERMEASURES TO PROTECT A RISC-V CORE AGAINST BOTH SOFTWARE AND PHYSICAL ATTACKS Speaker: William Pensec, Université Bretagne Sud, Lab-STICC, FR Authors: William PENSEC¹, Vianney Lapotre² and Guy Gogniat³ ¹Université Bretagne Sud, UMR CNRS 6285, Lab-STICC, FR; ²University Bretagne Sud, UMR CNRS 6285, Lab-STICC, FR; ³Université Bretagne Sud, FR Abstract Nowadays, IoT devices face many threats. As these devices manipulate sensitive data, they need to be protected against both software and physical attacks. A solution against software attacks is to use a Dynamic Information Flow Tracking (DIFT) mechanism. DIFT techniques can detect various software attacks, such as memory overflow, SQL injections, etc, by attaching and propagating tags to information containers at runtime. A security policy allows determining its behaviour. If a malicious behaviour is detected, an alert can be raised. Several implementations have been studied in the literature: hardware, software, and hybrid. Information containers will differ on which type of DIFT is used; these range from files to registers. Hardware DIFT solutions can be grouped into two main categories: off-core and in-core. Off-core DIFT relies on a dedicated coprocessor to perform tag-related operations. This approach does not require internal processor modification and reduces the computation load on the main processor. In-core DIFT leads to significant invasive modification of the processor. Tag-related operations are spread over the pipeline stages and are computed in parallel with the data computations. Compared to the off-core approach, it does not require specific communication and synchronisation management. In this work, we consider the D-RI5CY processor implementing an in-core hardware DIFT. We analyse its behaviour against Fault Injection Attacks (FIA). FIA can be performed by disturbing the power supply, or the clock, using EM pulse or laser shots. Numerous studies have demonstrated the vulnerabilities of critical systems against FIAs, glitch injections on the power supply have been used to manipulate the program counter (PC). These physical attacks effectively bypass protection mechanisms, allowing attackers to hijack the targeted system. Our objective is to develop effective countermeasures against FIA to efficiently protect the D-RI5CY DIFT mechanism in order to have a robust system against software and physical attacks.
18:30 CEST	PhDF.33	SUPPORTING END USERS IN IMPLEMENTING QUANTUM COMPUTING APPLICATIONS Speaker: Nils Quetschlich, TU Munich, DE Authors: Nils Quetschlich and Robert Wille, TU Munich, DE Abstract Quantum computing has made tremendous improvements in both software and hardware that have sparked interest in academia and industry to realize quantum computing applications. To this end, several steps are necessary: choosing a suitable quantum algorithm, encoding it into a quantum circuit, selecting a suitable device, compiling the circuit accordingly, executing it, and finally decoding the result. These steps are rather tedious and error-prone and thus create a high entry barrier for end users with limited quantum computing expertise who need solutions to domain-specific problems. This situation is even worsened, since bad choices in the described steps can lead to pure noise and no usable solution in the worst case.
18:30 CEST	PhDF.34	FAULT-TOLERANT CNN ACCELERATOR WITH RECONFIGURABLE CAPABILITIES Speaker: Rizwan Tariq Syed, IHP GmbH - Leibniz Institute for High Performance Microelectronics, DE Authors: Rizwan Tariq Syed and Milos Krstic, Leibniz-Institut für innovative Mikroelektronik, DE Abstract Mapping AI models on hardware faces significant challenges due to high computation and energy requirements. The continuously varying AI requirements and workloads, add more to the existing challenge and cause hardware resource utilization to quickly reach a boundary. These challenges grew further for safety-critical applications, which require high-reliability standards. Thus, there is a requirement for efficient ways of implementing AI models on hardware, which, along with high reliability, can reconfigure itself to fulfill varying application requirements. This research work focuses on the CNN model, and with the aim to address the above-mentioned challenges, this thesis work presents: 1- Shared layers methodology to efficiently map CNN models on hardware, 2- Fault-tolerant CNN accelerator with reconfigurable capabilities based on shared layer methodology 3- Integration of a multi-purpose on-chip sensor in fault-tolerant reconfigurable CNN accelerator. The results obtained in this research work aim to establish a foundation for the development of fully reconfigurable resilient AI processing systems, thereby solving the reliability, performance, and energy consumption challenges faced by the computational hardware.
18:30 CEST	PhDF.35	CONQUERING TIMING UNPREDICTABILITY IN HIGH-LEVEL SYNTHESIS Speaker: Carmine Rizzi, ETH Zurich, CH Authors: Carmine Rizzi and Lana Josipovic, ETH Zurich, CH Abstract Designing hardware is a complex and time-consuming task that requires specialized expertise. High-Level Synthesis (HLS) tools have revolutionized this process by streamlining digital hardware design and making it more accessible. These tools start from high-level programming languages such as C/C++ and produce Register-Transfer Level~(RTL) code to design circuits (e.g., FPGA or ASIC). However, despite their potential, there remains a significant gap in the quality of circuits designed by experienced hardware engineers compared to those generated by HLS tools. This is mainly due to the inability of HLS to account for the effect of lower-level hardware implementation steps. One consequence of this qualitative disparity is the unpredictability of the operating frequency in circuits generated by HLS tools. The main goal of this thesis is to reduce this gap and the discrepancies between the HLS tool timing model and the final circuit's frequency. This represents a fundamental step in producing circuits with HLS that can achieve high and reliable operating frequency.
18:30 CEST	PhDF.36	DEEP LEARNING MODELS OPTIMIZATIONS FOR REAL-TIME INTELLIGENT VIDEO ANALYTICS Speaker: Michele Boldo, Università di Verona, IT Authors: Michele Boldo and Nicola Bombieri, Università di Verona, IT Abstract Real-time video analytics is becoming increasingly important in several domains, including healthcare and Industry 5.0. Edge-based processing is emerging as a solution to reduce latency, safeguard privacy, and effectively manage bandwidth constraints. Although Deep Learning (DL) models are very effective, their significant computational requirements pose significant challenges for implementation on low-power edge devices. My thesis introduces two main methodologies to face these challenges. The first one is based on Collaborative Deep Inference. This approach mitigates accuracy degradation by partitioning the DL model across multiple devices. The model is dynamically split between the edge device and a server, with the division point selected based on latency constraints and the computational and transmission conditions. Data quantization and compression techniques are employed to minimize the impact on accuracy while optimizing performance. The second one is based on Online Domanin Adaptation. This methodology focuses on adapting pre-trained DL models to specific deployment scenarios, particularly when real-world data deviate from the training data. Knowledge Distillation is employed to obtain labels at runtime, where a larger, well-trained "teacher" model transfers its knowledge to a smaller, lightweight "student" model. To determine when the "student" model requires retraining, an algorithm based on Singular Value Decomposition (SVD) is used to monitor prediction quality over time without relying on external labels. The results demonstrate that both methodologies achieve high accuracy while significantly reducing energy consumption and enhancing frame rates.
18:30 CEST	PhDF.37	COLLECTIVE METHODOLOGIES FOR EFFICIENT HIGH-LEVEL SYNTHESIS Speaker: Aggelos Ferikoglou, National TU Athens, GR Authors: Aggelos Ferikoglou¹, Sotirios Xydis¹ and Dimitrios Soudris² ¹National TU Athens, GR; ²National Technical University of Athens, GR Abstract This PhD thesis is dedicated to developing methodologies that empower users to effectively utilize High-Level Synthesis (HLS) for Field Programmable Gate Arrays (FPGAs). The primary aim is to simplify the complex and time-intensive process of understanding hardware concepts, making HLS accessible to those without prior expertise in the field. The research emphasizes democratizing HLS by offering good starting points for design optimization and ready-to-use tools that enable designers produce high-quality results.
18:30 CEST	PhDF.38	A DESIGN SPACE EXPLORATION FRAMEWORK FOR DNN COMPRESSION USING LOW RANK FACTORIZATION Speaker: Milad Kokhazadeh, School of Informatics, Aristotle University of Thessaloniki, GR Authors: Milad Kokhazadeh¹, Georgios Keramidas² and Vasilios Kelefouras³ ¹PhD Candidate, Aristotle University of Thessaloniki, GR; ²Aristotle University of Thessaloniki/Think Silicon S.A., GR, GR; ³University of Plymouth, GB Abstract Deep neural networks (DNNs) deliver state-of-the-art performance across various applications but are highly computationally demanding, restricting their deployment on resource-limited edge devices. Low-rank factorization (LRF) is a promising technique to reduce complexity and memory footprint of DNNs while maintaining performance. However, challenges remain in optimizing rank selection, balancing memory-accuracy trade-offs, and integrating LRF into training. To address these challenges, we propose two methodologies: a design space exploration (DSE) framework for optimizing LRF configurations and a feature-map similarity-based strategy for compressing convolutional layers. Our approach automates rank selection and dynamically adjusts compression ratios, achieving over 90% parameter reduction while preserving accuracy, enabling efficient DNN deployment on resource-limited platforms.
18:30 CEST	PhDF.39	HARDWARE CNN ACCELERATOR DESIGNS CONFIGURED WITH STATISTICALLY ERROR VARIANT APPROXIMATE MULTIPLIERS Speaker: Bindu G Gowda, International Institute of Information Technology Bangalore, IN Authors: Bindu G Gowda¹ and Madhav Rao² ¹International Institute of Information Technology, Bangalore, IN; ²International Institute of Information Technology-Bangalore, IN Abstract Convolutional Neural Networks (CNNs) are renowned for their exceptional feature extraction capabilities, making them a cornerstone in various applications. However, implementing CNNs in hardware poses challenges due to extensive computational requirements, especially in multipliers, which are the most power-intensive and latency-prone units. Approximate computing techniques have gained attention for their potential to reduce power consumption, enhance performance, and improve space efficiency. Despite the widespread intention to apply approximate computing to AI workloads, the hardware benefits have not been fully realized alongside compromises in network accuracy in the past. This research work commenced with the design of novel error-balanced approximate multipliers (AM), which involved introducing approximation at the partial product reduction stage of the multiplication process using approximate compressors (AC). Two categories of AC designs were proposed considering the statistical mean error and the direction of the error distribution, and 8 distinct configurations of AMs were constructed by strategically positioning these ACs over the generated partial products and in the successive reduction stages, to achieve error-balanced designs showcasing favorable error metrics. The focus of this research work endeavors to introduce approximate multipliers along the convolutional layers of the CNN and thus present a unique framework for designing hardware-efficient and error-resilient on-chip design solutions for accelerating Machine Learning workloads. Adopting AMs relaxes hardware demands but suffers from the drop in network accuracy, and hence, choosing AMs becomes pivotal. Leveraging a precise combination of multipliers along the convolutional layers instead of uniform multipliers throughout the network was found to enhance the network performance. Considering that the exhaustive approach is a highly laborious task, it was important to explore the use of optimization algorithms to arrive at the optimal solution. Single and multi-objective algorithms were further exploited to identify the Pareto-optimal solutions comprising of AM sequences that balance hardware parameters and CNN accuracy loss.
18:30 CEST	PhDF.40	SECURING THE TEST INFRASTRUCTURE OF SOCS Speaker: Anjum Riaz, IIT Jammu, IN Authors: Anjum Riaz and Satyadev Ahlawat, IIT Jammu, IN Abstract The IEEE Standard 1687 (IJTAG) has become a widely adopted framework for efficient access to on-chip instruments, enabling functionalities like testing, diagnostics, post-silicon validation, and system health monitoring throughout the lifecycle of System-on-Chip (SoC) devices. However, its lack of integrated security mechanisms exposes the scan network to potential side-channel and malicious instrument attacks, posing risks such as data sniffing, alteration, IP theft, and reverse engineering. Existing solutions leveraging user authorization, cryptographic methods, and secure protocols partially address these vulnerabilities but often fail to scale efficiently or preserve IJTAG's functional flexibility. This thesis proposes a secure extension to the IJTAG standard by introducing the Inherently Secure SIB (ISSIB), designed to safeguard the IJTAG network from unauthorized access while maintaining its dynamic reconfigurability. The ISSIB achieves robust security with minimal area overhead (1.11% compared to standard SIB). Additionally, the scope of ISSIB is extended to secure high-speed Streaming Scan Networks (SSN) with an area overhead of only 1.91%, significantly lower than alternative solutions. Further enhancements include a topology to mitigate data sniffing and alteration threats by ensuring direct data paths between test instruments, avoiding interference by malicious components. Lastly, this study explores leveraging functional ports (e.g., UART) as secure alternatives to Test Access Ports (TAP), reducing access time and data overhead by up to 45.51% and 69.66%, respectively, while maintaining encryption-based security. These advancements address critical IJTAG vulnerabilities, enabling secure and efficient operation across resource-constrained and high-performance SoC environments.
18:30 CEST	PhDF.41	OPEN-SOURCE DESIGN OF A LOW-POWER SNN HARDWARE ACCELERATOR FOR EDGE AI Speaker: Luca Martis, Università degli studì di Cagliari, IT Authors: Luca Martis¹ and Paolo Meloni² ¹Università degli studi di Cagliari, IT; ²Università degli Studi di Cagliari, IT Abstract Edge computing brings data processing closer to the source that generates the data, offering benefits such as reduced la tency, lower bandwidth usage, and increased system reliability. Implementing artificial intelligence (AI) algorithms at the edge is essential for creating intelligent sensors but poses challenges due to the energy and computational limitations of edge devices. To address these challenges, Spiking Neural Networks (SNNs) have gained attention as a promising AI solution for edge applications due to their energy efficiency. Neuromorphic processors, optimized for the sparse, event driven nature of SNNs, offer significant energy savings and faster response times. However, the adoption of neuromorphic processors remains constrained by their high costs and the challenges of integrating them with existing edge devices. The objective of my research is to develop a low-power hardware accelerator for SNNs, optimized for real-time edge applications. The design will leverage open-source EDA tools to reduce costs and foster innovation, ensuring accessible and efficient edge AI solutions. In this work, we present the main results achieved so far, which can be summarized as follows: • We present an SNN hardware macro developed using an open-hardware approach, exploiting the open-source silicon implementation flow Openlane [8] and the open PDK SKY130. • We test the hardware macro on a rigorously validated use case demonstrating high energy efficiency in sensor data processing. • We demonstrate the effectiveness of lightweight SNNs for this use case, achieving accuracy results similar to state-of-the-art solutions.
18:30 CEST	PhDF.42	A PROPOSED EDA FLOW FOR ITERATIVE HARDWARE/RESILIENCE CO-DESIGN Speaker: Peer Adelt, University of Applied Sciences Hamm/Lippstadt, Germany, DE Authors: Peer Adelt¹ and Achim Rettberg² ¹Hamm-Lippstadt University of Applied Sciences, DE; ²Carl von Ossietzky University Oldenburg, DE Abstract The classical HW/SW co-design flow begins with requirement elaboration, where system objectives and constraints are analysed to define functional and non-functional requirements. A high-level system specification is then created, outlining desired functionality and performance targets. Next, the application is partitioned into hardware and software components, guided by factors such as performance, cost, energy efficiency, and development complexity to leverage the strengths of each domain. Building on this flow, the proposed extension focuses on fault detection and resilience. It introduces fault modelling, resilience-oriented partitioning, and robust testing strategies like fault injection and resilience-aware co-simulation. The proposed methodology, demonstrated with several applications for the 32-bit Freedom-E RISC-V platform, is available as open-source software on GitHub under https://github.com/hshl-hmit/fear-v.
18:30 CEST	PhDF.43	DESIGN AND APPLICATIONS OF SIMULATED BIFURCATION ISING MACHINES Speaker: Tingting Zhang, McGill University, CA Authors: Tingting Zhang¹ and Jie Han² ¹McGill University, CA; ²University of Alberta, CA Abstract Ising machines have received growing interest as efficient and hardware-friendly solvers for combinatorial optimization (CO) problems. They search for the absolute or approximate ground states of the Ising model. A simulated bifurcation (SB) Ising machine searches for the solution by solving pairs of differential equations related to the oscillator positions and momenta. It benefits from massive parallelism but suffers from relatively high hardware costs. To enhance efficiency while maintaining high-quality solutions for CO problems, this project attempts to use quantization schemes, stochastic computing-based integrators, and approximate multipliers in SB machines. As example applications, the traveling salesman problem (TSP), the multi-input multi-output (MIMO) detection problem and approximate Boolean decomposition are explored. Quantized SB machines (QSBMs) use innovative quantization methods to replace costly multiplication operations with simpler ones. Ternary algorithms dynamically simplify calculations, and advanced multi-value approaches improve numerical precision. Implemented on an FPGA, a QSBM with 2048 spins reduces hardware usage by over 50% and delivers 98.5% of the best-known solution in just 0.73 ms. Dynamic stochastic computing offers efficient accumulation operations. Stochastic SB machines (SSBMs) use signed stochastic integrators (SSIs) for numerical integration, achieving significant area reductions. Two SB cell types improve efficiency: one focuses on area savings, and the other on reducing delays. The SSBM demonstrates a significant area reduction of at least 10.62% compared to the latest designs. Floating-point (FP) representations enable accurate CO solutions but demand more hardware. This work proposes hardware-efficient logarithmic FP multipliers, achieving quality solutions with reduced costs. In routing and scheduling, the TSP is mapped to an Ising problem, using dynamic time steps and redundant spins to enhance solution quality and runtime. For MIMO systems, the SB-based detector applies regularization and dropout strategies, achieving lower error rates than traditional methods. Our column-based approximate Boolean decomposition achieves a smaller mean error distance with a speedup when approximately decomposing 16-input Boolean functions.
18:30 CEST	PhDF.44	COMPUTATION-IN-MEMORY BASED EDGE-AI FOR HEALTHCARE: A CROSS-LAYER APPROACH Speaker: Sumit Shaligram Diware, Computer Engineering Lab, Delft University of Technology, The Netherlands, NL Authors: Sumit Diware and Rajendra Bishnoi, TU Delft, NL Abstract Edge computing for AI (edge-AI) combines data sources with local AI processing hardware, to provide low response latency, alleviate network costs, enhance data privacy/security, and improve service reliability. Computation-in-memory (CIM) presents a promising alternative to conventional hardware for designing energy-efficient and compact edge-AI hardware. It achieves this through in-situ data processing using emerging memory technologies called memristors. CIM-based edge-AI hardware holds significant potential for AI-based healthcare, where it can greatly enhance the human well-being by fast, reliable, and secure processing of medical data. Designing CIM edge-AI hardware for healthcare is a two-phase process spanning six abstraction layers. The first phase involves creating a customized neural network model for the healthcare task and covers the first three abstraction layers (application, algorithm, and optimization). The challenge here is to achieve strong and effective algorithmic performance, while tailoring the model to maximize CIM hardware benefits. The second phase focuses on translating model computations into CIM hardware operations and spans the remaining three abstraction layers (mapping, micro-architecture and circuits, device). Mitigating memristor non-idealities which introduce computational errors in CIM operations becomes the primary challenge in this phase. Moreover, it is crucial to integrate the model and mitigation techniques into a holistic solution as a chip prototype. This thesis addresses these challenges using a cross-layer approach. We first create effective and energy-efficient models for cardiac arrhythmia classification and diabetic retinopathy screening, through contributions across the first three abstraction layers. To translate the models onto CIM hardware without accuracy loss, we identify the key memristor non-idealities and devise mitigation strategies against them by contributing over the remaining three abstraction layers. Lastly, we integrate our arrhythmia classification model and non-ideality mitigation strategies into a chip prototype. Thus, our work covers the full abstraction layer stack and paves the way for enhanced AI-based healthcare.
18:30 CEST	PhDF.45	OPTIMIZING LEARNING THROUGH CO-DESIGN IN NEUROMORPHIC COMPUTING Speaker: Lakshmi Varshika Mirtinti, Drexel University, US Authors: Lakshmi Varshika M and Anup Das, Drexel University, US Abstract Deep convolutional neural networks (CNNs) have tradi- tionally relied on GPUs for training and inference, but these platforms face inherent limitations, including memory band- width constraints, data access bottlenecks, and high energy consumption, making them unsuitable for edge and real- time applications. Neuromorphic systems, inspired by spiking neural networks (SNNs), offer a transformative alternative by mimicking biological neural systems to achieve superior computational efficiency and significantly lower power con- sumption. A pivotal innovation in neuromorphic computing is on- chip learning, enabling continuous adaptation to streaming data in real time, akin to human-like dynamic learning. This adaptability allows refinement of decision-making in response to evolving data, overcoming the static and task-specific nature of traditional AI systems. Unsupervised learning plays a critical role in this paradigm, enabling pattern and feature extraction from unstructured, unlabeled data prevalent in real- world applications. Instead of treating hardware and software as independent, separate stages in the development process, co- design aligns them from the start, optimizing the interaction between the two. This approach is particularly important for complex, performance-critical systems. This thesis proposes a design methodology for efficient on-chip training of unsupervised applications on hardware. It introduces (1) an Online Learning Unit (OLU) to address hardware challenges for selective unsupervised learning on neuromorphic platforms and (2) a co-design framework that maps spiking CNN applications to diverse core architec- tures [7]. The following sections outline these contributions and their implications for advancing neuromorphic computing in practical AI solutions.
18:30 CEST	PhDF.46	FAULT-TOLERANT TECHNIQUES FOR EMERGING NON-VOLATILE MEMORIES AND NEUROMORPHIC COMPUTING SYSTEMS Speaker and Author: Surendra Hemaram, Karlsruhe Institute of Technology, DE Abstract The need for high-performance and low-power consumption for modern computing systems has led to aggressive technology scaling, increasingly limiting the potential of CMOS memory technologies. Emerging non-volatile memories (NVMs) have revolutionized data storage, making them viable alternatives to CMOS memories. In the context of a standalone memory, among other NVMs, spin-transfer torque magnetic random access memory (STT-MRAM) is the most promising candidate, as shown by several industrial demonstrations. However, it has some reliability issues, including soft and hard errors. Addressing these failure mechanisms in STT-MRAM is essential to improve reliability and manufacturing yield. Building on the advancements of emerging NVMs, neuromorphic computing systems have also emerged as a promising approach for neural network (NN) computations, which demand massive storage and matrix operation. In particular, the computation-in-memory (CiM) paradigm, based on a crossbar of resistive NVMs, seamlessly integrates storage and computation, addressing the memory wall issue in conventional architecture. Additionally, unlike CiM, digital NN accelerators, which use memory for storage, have also benefited from advancements in memory technologies by using them as on/off-chip NN weight storage. However, ensuring the reliability of neuromorphic systems is challenging, as resistive NVMs are prone to faults like manufacturing defects, non-idealities, and random telegraph noise, resulting in soft and hard errors and degrading the NN accuracy. Therefore, fault-tolerant mechanisms are crucial for reliable NN operation. This thesis explores hardware-efficient fault-tolerant techniques, which employ error-correcting codes in conjunction with architectural modifications to improve the reliability of NVMs and neuromorphic computing systems. In the context of a standalone memory, we focus on STT-MRAM. However, the proposed solutions can also be applied to other NVMs. In the case of neuromorphic computing applications, we introduce fault-tolerant techniques for resistive NVM-based crossbars used in CiM architectures and address the vulnerabilities in the weight memories of digital NN hardware accelerators.
18:30 CEST	PhDF.47	IMPROVING THE EFFICIENCY AND SECURITY OF FULLY HOMOMORPHIC MACHINE LEARNING AS A SERVICE Speaker and Author: Lars Folkerts, University of Delaware, US Abstract Machine learning services have become increasingly prevalent in users' daily lives, with both individuals and businesses integrating these technologies into a wide range of applications. From personalized recommendations on streaming platforms to advanced analytics in business operations, the benefits of machine learning (ML) are vast and undeniable. However, despite the widespread adoption of these services, privacy concerns have remained a significant barrier to further integration. Let's consider the Machine Learning as a Service (MLaaS) paradigm with an honest-but-curious cloud server. Here, users send their data to the cloud for processing, and the cloud sends the computation result back to the user. This enables them to offload computational costs and leverage the cloud service provider's proprietary statistical models, such as neural networks. However, a curious provider may access and exploit stored data, which could then be sold to advertisers or used to improve the model. Fully homomorphic encryption (FHE) offers a solution by enabling computation on encrypted data without revealing it. FHE allows users to send encrypted data to cloud providers for processing, who can execute algorithms on the ciphertext without revealing the underlying plaintext. After computation, the encrypted result is sent back to the user to decrypt. This approach allows cloud servers to maintain control over their model IP while protecting user data privacy. My thesis provides the foundational groundwork for developing the next generation FHE-based privacy-preserving machine learning. The efficiency improvements include novel binary neural network speedup techniques (REDsec), generative AI algorithms (Tyche) and a FHE-based MLaaS protocol that supports authenticated data storage (Proteus). My research also evaluates the security of several encrypted ML architectures against side-channels, including user data privacy in multi-exit neural networks (FHE-MENNs) and single FHE-layer transformers (Testing Split Model LLMs). These works outline a promising future for secure and feasible private MLaaS computation.
18:30 CEST	PhDF.48	OPTIMIZING CONVOLUTIONAL WEIGHT MAPPING FOR ENERGY-EFFICIENT IN-MEMORY CNN INFERENCE Speaker: Johnny Rhe, Sungkyunkwan University, KR Authors: Johnny Rhe and Jong Hwan Ko, Sungkyunkwan University, KR Abstract In-memory computing (IMC) architectures have emerged as one of the most viable options for faster and more power-efficient convolutional neural networks (CNNs) inference. The key challenge in PIM architectures is optimizing the mapping of convolutional weights onto memory arrays to enhance energy efficiency and reduce inference latency. Recent research has introduced mapping methods to facilitate convolution operations within IMC arrays. However, existing approaches often fail to optimize memory usage, as they do not account for variations in array and layer sizes. This limitation results in underutilized resources, increased energy consumption, and large inference accuracy drop. This thesis addresses these limitations by proposing a multi-level optimization approach, specifically focusing on array-level, row-level, and cell-level pruning techniques for energy-efficient weight mapping in IMC-based CNN inference
18:30 CEST	PhDF.49	LEARN TO FLY : ENABLING DEEP LEARNING BASED PERCEPTION & CONTROL FOR AERIAL ROBOTS Speaker and Author: Veera Venkata Ram Murali Krishna Rao Muvva, University of Nebraska Lincoln, US Abstract Volants are extraordinary creatures. We can observe amazing phenomena in their flight, such as the precision of bald eagles in stormy conditions, the innate migration of Arctic terns without GNSS, the meticulous docking of hummingbirds, and the exceptional vision-based tracking of falcons. My career goal is to harness these extraordinary capabilities observed in volants, to overcome the current limitations in aerial systems. Volants achieve this intelligence not because they are experts in fluid dynamics, control systems, or computer vision; they succeed via a continuous learning process and the strong integration of perception and control modules. I believe that integrating traditional control theory and computer vision with machine and deep learning can offer solutions for designing and deploying robust aerial robots.
18:30 CEST	PhDF.50	PERFORMANCE AND ENERGY EFFICIENT SECURE COMPUTING ON EDGE DEVICES Speaker: Ismet Dagli, Colorado School of Mines, US Authors: Ismet Dagli and Mehmet Belviranli, Colorado School of Mines, US Abstract As in-the-field computation demands increase, the use of more sophisticated heterogeneous System-on-Chips (SoCs) becomes more common in many edge devices. Advancing beyond monolithic single-processor architectures, SoCs have evolved to process a spectrum of computation through the integration of multiple domain-specific accelerators (DSAs). The architectural design choices, such as varying computation/power characteristics among these DSAs and CPUs, can further enhance the overall throughput and energy efficiency [17]. This approach enables collaborative execution, wherein tasks in a workload are dynamically mapped to the most efficient processing unit (PU) [15, 20, 10]. The total utilization of the system could be further improved by concurrently running tasks, e.g., layers in a deep neural network (DNN). Future computing systems are expected to scale either the number of accelerators by embedding more processor diversity in a computing device [18, 16] and the number of computing devices by connecting more computing devices and/or the cloud in many different domains such as federated learning and connected autonomous cars [14, 1]. Overall, efficient execution of modern edge/cloud workloads requires understanding their performance at both the SoC level and the system level, and should aim to improve three distinguishably critical considerations, energy, latency, throughput, and security: • Energy consumption is critical in autonomous and mobile domains where the deployment of machine learning tasks, particularly DNNs, incurs significant power consumption. Our study, published in DAC'22, optimizes energy consumption and is detailed in Section 2. • Minimizing computational latency is achievable by understanding performance bottlenecks at SoC and system levels. Our work, published in PPoPP'24 and received SRC finalist award in MICRO'22, optimizes latency and performance and is presented in Section 3. • Security vulnerabilities on shared memory attacks have become more and more common as we target to optimize the performance on resource-limited edge devices. Our work, accepted to DATE'25, investigates shared memory vulnerabilities as in Section 4. • Improving the total system throughput necessitates an in-depth analysis of resource utilization at the SoC- level and system-level. Our work, currently under submission to top-tier conference and also received SRC finalist award in CGO'24, proposes holistic resource management in Section 5.
18:30 CEST	PhDF.51	ENHANCING QUANTUM CLOUD PERFORMANCE THROUGH ADVANCED TECHNIQUES Speaker: Tingting Li, Zhejiang University, CN Authors: Tingting Li¹, Jianwei Yin² and Liqiang Lu¹ ¹Zhejiang University, CN; ²Zhejiang University, Abstract Quantum computing cloud services are composed of an ecosystem that integrates hardware, software, and network infrastructure to provide users with access to quantum computing resources over the Internet. Quantum computing cloud services have emerged as a pivotal domain in the realm of computational technology, offering unprecedented computational capabilities that could revolutionize various industries. The concept refers to the provision of quantum computing resources over the cloud, allowing users to access and utilize quantum processors without the need for physical ownership. This thesis encompasses a series of optimizations from the hardware level to software services, aiming to enhance the efficiency and reliability of quantum cloud services. On the hardware side, we explore the use of Mixture of Experts (MoE) for the automatic calibration of superconducting quantum computers. On the software side, we investigate quantum serverless function orchestration for task allocation optimization. In terms of cloud service security, we explore quantum fingerprinting for cloud security using quantum task output. These efforts collectively contribute to the advancement of quantum cloud services, ensuring their robustness and security in the face of evolving computational demands.
18:30 CEST	PhDF.52	EFFICIENT REFINEMENT OF HUMAN POSE ESTIMATION FOR INDUSTRY 5.0 Speaker: Enrico Martini, Università di Verona, IT Authors: Enrico Martini and Nicola Bombieri, Università di Verona, IT Abstract This thesis addresses challenges in markerless Human Pose Estimation (HPE), including noise, occlusions, and computational constraints, by developing real-time filtering techniques that combine learned models with traditional methods. Key contributions include BeFine, a distributed 3D HPE industrial telemonitoring system that uses edge devices to capture multi-view poses and applies advanced filtering and clustering algorithms. In human-robot interaction, a filtering pipeline improves incomplete 3D poses from RGB-D cameras, mitigating occlusion effects and enabling collision prediction. This work enhances markerless motion capture, demonstrating its value in real-world applications.
18:30 CEST	PhDF.53	LATTICE-BASED CRYPTOGRAPHY: BEYOND NIST STANDARDIZATION Speaker: Suparna Kundu, COSIC, KU Leuven, BE Authors: Suparna Kundu¹, Ingrid Verbauwhede² and Angshuman Karmakar³ ¹COSIC, KU Leuven, BE; ²KU Leuven, BE; ³IIT Kanpur, IN Abstract The National Institute of Standards and Technology (NIST) published the first set of post-quantum cryptographic standards in 2024. Although this is a significant step towards the transition from classical public-key cryptography (PKC) to post-quantum cryptography (PQC), several issues, such as new designs, lightweight implementations, physical attacks, and their countermeasures, need to be addressed before the widespread deployment of PQC in real-world applications. The primary focus of my thesis was to address some of these problems. My thesis bridged the gap between the theory and practice of PQC, especially lattice-based key-encapsulation mechanisms.

REC Reception

Add this session to my calendar

Date: Monday, 31 March 2025
Time: 18:30 CEST - 20:00 CEST
Location / Room: Forum 1&2

Tuesday, 01 April 2025

ASD04 ASD regular session: Novel Safety Metrics, Adaptive Patterns for Resilience, and Legal Frameworks in Autonomous Systems Design

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Roseraie 1&2

Session chair:
Dirk Ziegenbein, Robert Bosch GmbH, DE

Session co-chair:
Rolf Ernst, TU Braunschweig, DE

This session discusses key design aspects for safety, adaptability, and legality of autonomous systems. First, a framework that utilizes cross-channel safety performance indicators (SPIs) to identify and tackle hazardous driving scenarios for automated vehicles and corresponding evidence from a proof-of-concept implementation is presented. The session continues with the introduction of the Reflex pattern, an innovative approach inspired by biological reflexes that enhances system resilience by dynamically responding to fluctuating resources, demonstrated through a drone image processing scenario. Lastly, the role of legal considerations in the design of automated vehicles is explored, especially those intended to transport intoxicated individuals, underscoring the need for a multidisciplinary collaboration among management, marketing, engineering, and legal teams to ensure the development of functionally robust and legally sound systems.

Time	Label	Presentation Title Authors
08:30 CEST	ASD04.1	IDENTIFICATION OF HAZARDOUS DRIVING SCENARIOS USING CROSS-CHANNEL SAFETY PERFORMANCE INDICATORS Speaker: Caspar Hanselaar, Eindhoven University of Technology, NL Authors: Caspar Hanselaar¹, Murali Manohar Selva Kumar², Yuting Fu², Andrei Terechko², Ranga Rao Venkatesha Prasad³ and Emilia Silvas¹ ¹Eindhoven University of Technology, NL; ²NXP, N/; ³TU Delft, NL Abstract Automated Driving (AD) vehicles are slowly being deployed on public roads. These AD vehicles will encounter hazardous (dangerous) scenarios due to unforeseen test cases at design time and changing environments on the road after deployment. To allow developers of AD systems to mitigate such unforeseen risks, the safety of AD vehicles needs to be continuously monitored after deployment. To this end, the UL4600 standard and AVSC guidelines recommend the use of safety performance indicators (SPIs) by AD vehicle developers. Our paper presents a framework which uses SPIs to identify potentially hazardous scenarios specific to the evaluated AD system, covering both AD vehicles and cloud operations. The framework uses the perception systems and motion plans of heterogeneous redundant multi-channel architectures to detect hazards invisible in single-channel-based systems. We propose three cross-channel SPIs and use them to identify hazardous scenarios in the AD vehicle and validate this approach with a proof-of-concept implementation. In a test of 6 challenging routes in the CARLA simulator our framework automatically identifies 86% of hazardous situations. Next it identifies contributing issues in the AD vehicle, such as missed object detections or dangerous planned trajectories. With this proof of concept, we show that this framework provides evidence on the safety of deployed systems, identifies AD vehicle functions in need of improvement and provides lessons for the development of future AD systems.
09:00 CEST	ASD04.2	DESIGNING RESILIENT AUTONOMOUS SYSTEMS WITH THE REFLEX PATTERN Speaker: Julian Demicoli, TU Munich, DE Authors: Julian Demicoli and Sebastian Steinhorst, TU Munich, DE Abstract Autonomous systems face significant challenges due to fluctuating resources and unstable environments, where traditional redundancy strategies for resilience can be inefficient. We present the Reflex pattern, inspired by biological reflexes, promoting system resilience by dynamically adapting to changing resource conditions. By switching between complex and resource-efficient algorithms based on availability, the pattern optimizes efficient resource utilization without extensive redundancy, ensuring essential functionalities remain operational under constraints. To facilitate adoption, we introduce ReflexLang, a domain-specific language (DSL) enabling automated code generation for reflex-pattern-based systems. We validate the pattern's effectiveness in a drone image processing scenario, demonstrating its potential to enhance operational integrity and resilience.
09:30 CEST	ASD04.3	LAW AS A DESIGN CONSIDERATION FOR AUTOMATED VEHICLES SUITABLE TO TRANSPORT INTOXICATED PERSONS Speaker: William Widen, University of Miami, US Authors: Marilyn Wolf¹ and William Widen² ¹University of Nebraska, US; ²University of Miami, N/ Abstract This essay explains why an automated vehicle (AV) manufacturer should consider law during the design process for an AV intended as "fit-for-purpose" to transport intoxicated persons. It suggests that management, marketing, engineering, and legal functions collaborate to develop product requirements and specifications that shield owner/occupants from criminal liability for DUI manslaughter and negligent homicide, as well as guard against civil liability.

BPA04 BPA Session 4

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: St Clair 3AB

Session chair:
Matteo Sonza Reorda, Politecnico di Torino, IT

Session co-chair:
Ulf Schlichtmann, TU Munich, DE

Time	Label	Presentation Title Authors
08:30 CEST	BPA04.1	A LIGHTWEIGHT CNN FOR REAL-TIME PRE-IMPACT FALL DETECTION Speaker: Cristian Turetta, Università di Verona, IT Authors: Cristian Turetta¹, Muhammed Ali¹, Florenc Demrozi² and Graziano Pravadelli¹ ¹Università di Verona, IT; ²Department of Electrical Engineering and Computer Science, University of Stavanger, NO Abstract Falls can have significant and far-reaching effects on various groups, particularly the elderly, workers, and the general population. These effects can impact both physical and psychological well-being, leading to long-term health problems, reduced productivity, and a decreased quality of life. Numerous fall detection systems have been developed to prompt first aid in the event of a fall and reduce its impact on people's lives. However, detecting a fall after it has occurred is insufficient to mitigate its consequences, such as trauma. These effects can be further minimized by activating safety systems (e.g., wearable airbags) during the fall itself, specifically in the pre- impact phase, to reduce the severity of the impact when hitting the ground. Achieving this, however, requires recognizing the fall early enough to provide the necessary time for the safety system to become fully operational before impact. To address this challenge, this paper introduces a novel lightweight convolutional neural network (CNN) designed to detect pre-impact falls. The proposed model overcomes the limitations of current solutions regarding deployability on resource-constrained embedded devices, specifically for controlling the inflation of an airbag jacket. We extensively tested and compared our model, deployed on an STM32F722 microcontroller, against state-of-the-art approaches using two different datasets.
08:50 CEST	BPA04.2	COCKTAIL: CHUNK-ADAPTIVE MIXED-PRECISION QUANTIZATION FOR LONG-CONTEXT LLM INFERENCE Speaker: Wei Tao, Huazhong University of Science and Technology, CN Authors: Wei Tao¹, Bin Zhang¹, Xiaoyang Qu², Jiguang Wan¹ and Jianzong Wang³ ¹Huazhong University of Science and Technology, CN; ²Ping An Technology (shenzhen)Co., Ltd, CN; ³Ping An Technology, CN Abstract Recently, large language models (LLMs) have been able to handle longer and longer contexts. However, a context that is too long may cause intolerant inference latency and GPU memory usage. Existing methods propose mixed-precision quantization to the key-value (KV) cache in LLMs based on token granularity, which is time-consuming in the search process and hardware inefficient during computation. This paper introduces a novel approach called Cocktail, which employs chunk-adaptive mixed-precision quantization to optimize the KV cache. Cocktail consists of two modules: chunk-level quantization search and chunk-level KV cache computation. Chunk-level quantization search determines the optimal bitwidth configuration of the KV cache chunks quickly based on the similarity scores between the corresponding context chunks and the query, maintaining the model accuracy. Furthermore, chunk-level KV cache computation reorders the KV cache chunks before quantization, avoiding the hardware inefficiency caused by mixed-precision quantization in inference computation. Extensive experiments demonstrate that Cocktail outperforms state-of-the-art KV cache quantization methods on various models and datasets. Our code is presented on https://github.com/Sullivan12138/Cocktail.
09:10 CEST	BPA04.3	RANKMAP: PRIORITY-AWARE MULTI-DNN MANAGER FOR HETEROGENEOUS EMBEDDED DEVICES Speaker: Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Authors: Andreas Karatzas¹, Dimitrios Stamoulis² and Iraklis Anagnostopoulos¹ ¹Southern Illinois University Carbondale, US; ²The University of Texas at Austin, US Abstract Modern edge data centers simultaneously handle multiple Deep Neural Networks (DNNs), leading to significant challenges in workload management. Thus, current management systems need to leverage the architectural heterogeneity of new embedded systems, enabling efficient handling of multi-DNN workloads. This paper introduces RankMap, a priority-aware manager specifically designed for multi-DNN tasks on heterogeneous embedded devices. RankMap addresses the extensive solution space of multi-DNN mapping through stochastic space exploration combined with a performance estimator. Experimental results show that RankMap achieves x3.6 higher average throughput compared to existing methods, while effectively preventing DNN starvation under heavy workloads and improving the prioritization of specified DNNs by x57.5.

ET02 Securing the Future: Designing Built-in-Security Enabled Photonic AI Chip

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: St Clair 1

Organiser:
Dharanidhar Dang, University of Texas, San Antonio, US

Organisers: Dharanidhar Dang, Satwik Patnaik and Pierre Wahl

FS05 Focus Session - 3D Integration, Cryogenic Circuits and Superconducting Logic: Emerging Trends Shaping the Future of High-Performance Computing

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Rhône 1

Session chair:
Ahmedullah Aziz, University of Tennessee Knoxville, US

Session co-chair:
Hussam Amrouch, TU Munich, DE

Organiser:
Hussam Amrouch, TU Munich, DE

As CMOS scaling approaches its fundamental limits, the explosive rise of artificial intelligence (AI) and large language models (LLMs) is exposing profound challenges in today's computing architectures. The immense demand for memory, speed, and energy efficiency is pushing classical chips to their breaking point. This focus session will explore three transformative trends that are poised to redefine the future of high-performance computing and address the escalating challenges of AI-driven workloads. The first trend is 3D integration, an innovative paradigm that allows memory layers to be fabricated in the back end of line (BEOL), dramatically increasing on-chip memory capacity. Of the emerging memory technologies, ferroelectric memories stand out as particularly promising due to their compatibility with BEOL CMOS and their low-power, high-density operation. The second trend, cryogenic CMOS, leverages the advantages of operating circuits at cryogenic temperatures (77K and below), significantly enhancing transistor performance with steeper subthreshold slopes, higher on-currents, and lower off-currents, delivering remarkable speed and efficiency gains. The last trend is superconducting logic, which is set to revolutionize computing by achieving zero resistance, unlocking unprecedented levels of speed and energy efficiency. Our session brings together leading experts from both industry and academia to present cutting-edge solutions that are already reshaping the semiconductor landscape. Attendees will gain a deep, comprehensive understanding of the emerging trends that will drive the next generation of high-performance computing, providing a critical window into the chips that will fuel future advances in AI.

Time	Label	Presentation Title Authors
08:30 CEST	FS05.1	PUSHING THE BOUNDARIES OF AI CHIPS: FROM MONOLITHIC 3D CMOS TO CRYOGENIC COMPUTING Speaker: Hussam Amrouch, TU Munich (TUM), DE Authors: Mahdi Benkhelifa¹, Shivendra Parihar², Anirban Kar¹, Girish Pahwa³, Yogesh Chauhan⁴ and Hussam Amrouch⁵ ¹TU Munich, DE; ²University of California, Berkeley, US; ³National Yang Ming Chiao Tung University, TW; ⁴IIT Kanpur, IN; ⁵TU Munich (TUM), DE Abstract As CMOS scaling approaches its fundamental limits, the explosiverise of AI and LLMs has unveiled profound bottlenecks in computing architectures. This talk presents two groundbreaking paradigms poised to reshape the landscape of high-performance computing and meet the surging demands of AI-driven workloads. The first paradigm is 3D monolithic integration, a revolutionary approach that achieves unprecedented logic density through Complementary FETs (CFETs), where pMOS and nMOS transistors are vertically stacked, and a dramatic expansion of on-chip memory capacity by integrating memory layers atop logic transistors. The second paradigm leverages the transformative potential of operating chips at cryogenic temperatures—specifically around 77K—where transistors exhibit significantly enhanced performance, and parasitic resistances are substantially minimized. These advancements hold the promise of redefining computing efficiency and performance for the AI era.
08:53 CEST	FS05.2	TRANSISTOR AGING AND CIRCUIT RELIABILITY AT CRYOGENIC TEMPERATURES Speaker: Vishal Nayar, imec, BE Authors: Javier Fortuny and Vishal Nayar, IMEC, BE Abstract The increasing interest in cryogenic circuits is driven by their transformative potential across high-performance computing, medical devices, space exploration, and quantum technologies. Operating transistors at cryogenic temperatures, such as 77 K and below, yields substantial improvements, including increased ON current, reduced OFF current, and enhanced sub-threshold slope. While recent studies have explored device-level reliability at cryogenic temperatures, circuit-level reliability—particularly under bias temperature instability (BTI)—remains underexamined, leaving critical aging mechanisms at these temperatures not well understood. To bridge this gap, we designed and fabricated a customized chip in a commercial HKMG 28 nm technology. The chip integrates several ring oscillator (RO) circuits for precise characterization of accelerated aging effects, enabling evaluation of their impact on performance at cryogenic temperatures. Finally, we project technology degradation in a 10-year future comparing, the achieved wear out between room temperature (298 K) and at 77 K when operating circuits at the nominal voltage, revealing the significant mitigation of BTI aging when operating at affordable cryogenic temperatures.
09:15 CEST	FS05.3	FERROELECTRIC-SUPERCONDUCTING SYNERGY FOR FUTURE COMPUTING Speaker: Ahmedullah Aziz, University of Tennessee Knoxville, US Authors: Shamiul Alam¹ and Ahmedullah Aziz² ¹University of Tennessee Knoxville, US; ²University of Tennessee, Knoxville, US Abstract Ferroelectric Superconducting Quantum Interference Devices (Fe-SQUIDs) have recently gained attention as a transformative technology for superconducting computing, offering voltage-controlled switching that is essential for large-scale digital circuits. This unique technology has the potential to drive advancements in cryogenic computing by enabling scalable memory systems and voltage-controlled logic circuits. These innovations are critical for the realization of large-scale quantum computers and hold significant promise for high-performance computing and space exploration. In this article, we explore how Fe-SQUIDs, integrated with heater cryotrons (hTrons), can be harnessed to develop key components of computing systems. These include non-volatile memory, voltage-controlled logic circuits, in-memory matrix-vector multiplication systems, and ternary content-addressable memory. We also examine how changes in the key characteristics of Fe-SQUIDs and hTrons influence the performance of these applications, providing insights into the design and optimization of next-generation superconducting hardware.
09:38 CEST	FS05.4	MATERIAL-TO-SYSTEM CO-OPTIMIZATION FOR ADVANCED SEMICONDUCTOR MANUFACTURING Presenter: Gaurav Thareja, Applied Materials, US Author: Gaurav Thareja, Applied Materials, US Abstract The exponential growth of AI is tied to groundbreaking advancements in semiconductor technology, driven by the PPACt metrics: low Power, high Performance, reduced Area, low Cost, and faster Time to market. Traditionally, achieving these metrics has required years—often decades—of meticulous semiconductor innovation, progressing from concept to high-volume manufacturing. This process unfolds through four critical phases: materials discovery, process optimization, device engineering, and chip design. In this talk, we will explore how ML-driven methods are revolutionizing the semiconductor industry, significantly accelerating progress across all stages of development. We will highlight the key discoveries necessary for enabling novel materials that power cryogenic circuits, ferroelectric memories, and 3D integration.

HSD01 HackTheSilicon DATE

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 12:30 CEST
Location / Room: Rhône 4

This session marks the final hours of the 48-hour HACK@DATE'25 competition, where selected teams will put their skills to the test under the supervision of the HackTheSilicon team. Participants who have advanced to Phase 2 of HACK@DATE'25 and are attending DATE'25 in person are required to join this session to complete the competition.
Session Highlights:
Opening Briefing: A short explanation of the competition's latest status and final rules.
Bug Hunting & Supervised Debugging: Participating teams will focus on identifying and exploiting vulnerabilities in a controlled environment.
Live Q&A: Participants can interact with the HackTheSilicon team for guidance and clarifications.
Audience Access: While other DATE'25 attendees are welcome to observe the competition, only selected Phase 2 teams will be permitted to actively participate.
This session provides a unique opportunity for attendees to witness high-level hardware security testing in action and gain insights into the competition dynamics.

SD01 Special Day on AI and ML Trends

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Auditorium Pasteur

Session chair:
Marc Duranton, CEA, FR

Session co-chair:
Kuan-Hsun Chen, University of Twente, NL

This Special Day will focus in exploring the latest trends and innovations in Artificial Intelligence (AI) and Machine Learning (ML). AI (and mainly generative AI) is booming since the release of chat-GPT and it will change the future of Design, Automation and Test.

This Special Day will highlight the following topics:

Design of hardware architectures and software, including automatic exploration of large design spaces, assistance of the human designer, resource selection and optimization
Verification of hardware architectures, with topics such as performance prediction, (formal) design validation, accelerating simulations thanks to AI-Augmented Surrogate Models
AI-Accelerated Physical Design and Validation of layout and floorplans
New AI accelerators architectures
Sustainability in AI/ML Development

These topics will be presented by a lineup of distinguished speakers who are expert in their respective fields. The day will conclude with a panel discussion allowing exchanges with the audience and trigger discussions on the interaction between the various domains presented during the day.
This event is ideal for AI/ML researchers, data scientists, hardware designers, software developers, sustainability advocates, and anyone interested in the future directions of AI and ML for Design, Automation and Test.

Time	Label	Presentation Title Authors
08:30 CEST	SD01.1	INTRODUCTION TO THE SPECIAL DAY Speaker: Marc Duranton, CEA, FR, and Kuan-Hsun Chen, University of Twente, NL Authors: Marc Duranton¹, Kuan-Hsun Chen² and Ana Lucia Varbanescu³ ¹CEA, FR; ²University of Twente, NL; ³University of Amsterdam, NL Abstract .
08:45 CEST	SD01.2	AI FOR ANALOG AND RF IC DESIGN: FROM SPEC TO LAYOUT Presenter: David Z. Pan, The University of Texas at Austin, US Author: David Z. Pan, The University of Texas at Austin, US Abstract Analog and RF IC designs have long been a heavily manual process, from circuit topology to device sizing, and to layout generation. In the entire design process, extensive circuit simulations are performed to check if various design constraints/objectives can be met and optimized. However, this design process is very tedious and not scalable. This talk will present recent trends on AI for analog/RF IC design/EDA, from specification to layout, and from surrogate modeling to inverse design. Our overarching goal is to build an end-to-end analog/RF IC design and optimization flow, like that for digital ICs from RTL to GDS, to significantly improve the design productivity.
09:10 CEST	SD01.3	AI FOR HARDWARE DESIGN: WILL IT WORK? Presenter: Ramesh Karri, New York University, US Author: Ramesh Karri, New York University, US Abstract In this talk, I will cover a body of work from NYU on democratizing and supercharging hardware design using modern AI/ML techniques, from design specification to logic synthesis, and early-state timing and routing congestion prediction. We will describe Verigen and CL-Verilog, the first and current state-of-art LLMs specialized for automated Verilog code generation. To handle more complex designs, we will discuss our recent work on using Chain-of-Thought approaches for hierarchical Verilog code generation and agentic frameworks to automatically translate C code to HLS synthesizable C. Going beyond design specification to AI-native EDA, we will describe ABC-RL, a state-of-art MCTS and RL based method to identify the best logic synthesis recipes, as well as early-stage predictive models that identify post place-and-route congestion issues directly from RTL code. I will conclude by presenting a grand vision to build foundation models for hardware design, that learn semantically meaningful features for every step in the design flow, enabling the quick realization of agentic workflows.
09:35 CEST	SD01.4	AI FOR HARDWARE SYNTHESIS AND HARDWARE SYNTHESIS FOR AI Presenter: Antonino Tumeo, Pacific Northwest National Laboratory, US Author: Antonino Tumeo, Pacific Northwest National Laboratory, US Abstract Artificial intelligence (AI) approaches, and in particular Generative AI, use models of exponentially increasing size – quickly moving from millions, to billions, to trillions of parameters. While size directly influences performance (quantified as accuracy) and capabilities of the models, it also significantly increases resource requirements (memory and processing elements). At the same time, especially when applied to reasoning and decision making, these models need to perform at low latency to enable their use in a tight control loop. Privacy and the need for low response times are also driving towards the integration of Generative AI capabilities at the edge, further highlighting the need for custom, small but powerful, highly energy-efficient, accelerators. In this talk, I will discuss end-to-end synthesis approaches that automate the generation of custom AI accelerators starting from the model description in popular machine learning frameworks and enable no-human-in-the-loop design space exploration of the hardware implementations along several metrics (power/performance/area). I will present the Software Defined Architectures (SODA) Synthesizer, our highly productive, end-to-end, fully open-source, model-to-silicon hardware compiler, and discuss some of the low-latency data analytics use cases that we are exploring, including autonomous experimental workflows for precision material synthesis and workflows that combine data analytics with generative AI. Finally, using SODA as an example, I will provide a perspective on where co-optimization, AI, and in particular generative AI, could improve the actual hardware synthesis methodologies from models to physical implementations, leading to further increases in productivity, reduction in turn-around times, and increased "quality of results" in terms of efficiency metrics.

TS06 Design Automation for Quantum Computing

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Rhône 3AB

Session chair:
Robert Wille, TU Munich, DE

Session co-chair:
Ibrahim Elfadel, Khalifa University, UA

Time	Label	Presentation Title Authors
08:30 CEST	TS06.1	EMPOWERING QUANTUM ERROR TRACEABILITY WITH MOE FOR AUTOMATIC CALIBRATION Speaker: Tingting Li, Zhejiang University, CN Authors: Tingting Li¹, Ziming Zhao¹, Liqiang Lu¹, Siwei Tan² and Jianwei Yin¹ ¹Zhejiang University, CN; ²Zhejiang university, CN Abstract Quantum computing offers the potential for exponential speedups over classical computing in tackling complex tasks, such as large-number factorization and chemical molecular simulation. However, quantum noise remains a significant challenge, hindering the reliability and scalability of quantum systems. Therefore, effective characterization and calibration of quantum noise are critical to advancing these systems. Quantum calibration is a process that heavily relies on expert knowledge, and there currently is a range of research focused on automatic calibration. However, traditional calibration methods often need an effective error traceback mechanism, leading to repeated calibration attempts without identifying root causes. To address the issue of error traceback in calibration failures, this paper proposes an automatic calibration error traceback algorithm facilitated by a Mixture of Experts (MoE) system inspired by the current large language model technologies. Our approach enables traceability of quantum calibration errors, allowing for the rapid identification and correction of deviations from the calibration state. Extensive experimental results demonstrate that the MoE-based automatic calibration method significantly outperforms traditional error traceability and calibration efficiency techniques. Notably, our approach improved the average visibility of 77 qubits by 25.5%, surpassing the outcomes of fixed calibration processes. This work presents a promising path toward more reliable and scalable quantum computing systems.
08:35 CEST	TS06.2	OPTIMAL STATE PREPARATION FOR LOGICAL ARRAYS ON ZONED NEUTRAL ATOM QUANTUM COMPUTERS Speaker: Yannick Stade, TU Munich, DE Authors: Yannick Stade, Ludwig Schmid, Lukas Burgholzer and Robert Wille, TU Munich, DE Abstract Quantum computing promises to solve problems previously deemed infeasible. However, high error rates necessitate quantum error correction for practical applications. Seminal experiments with zoned neutral atom architectures have shown remarkable potential for fault-tolerant quantum computing. To fully harness their potential, efficient software solutions are vital. A key aspect of quantum error correction is the initialization of physical qubits representing a logical qubit in a highly entangled state. This process, known as state preparation, is the foundation of most quantum error correction codes and, hence, a crucial step towards fault-tolerant quantum computing. Generating a schedule of target-specific instructions to perform the state preparation is highly complex. First software tools exist but are not suitable for the zoned neutral atom architectures. This work addresses this gap by leveraging the computational power of SMT solvers and generating minimal schedules for the state preparation of logical arrays. Experimental evaluations demonstrate that actively utilizing zones to shield idling qubits consistently results in higher fidelities than solutions disregarding these zones. The complete code is publicly available in open-source as part of the Munich Quantum Toolkit (MQT) at https://github.com/cda-tum/mqt-qmap.
08:40 CEST	TS06.3	DESIGN OF AN FPGA-BASED NEUTRAL ATOM REARRANGEMENT ACCELERATOR FOR QUANTUM COMPUTING Speaker: Xiaorang Guo, TU Munich, DE Authors: Xiaorang Guo, Jonas Winklmann, Dirk Stober, Amr Elsharkawy and Martin Schulz, TU Munich, DE Abstract Neutral atoms have emerged as a promising technology for implementing quantum computers due to their scalability and long coherence times. However, the execution frequency of neutral atom quantum computers is constrained by image processing procedures, particularly the assembly of defect-free atom arrays, which is a crucial step in preparing qubits (atoms) for execution. To optimize this assembly process, we propose a novel quadrant-based rearrangement algorithm that employs a divide-and-conquer strategy and also enables the simultaneous movement of multiple atoms, even across different columns and rows. We implement the algorithm on Field Programmable Gate Arrays (FPGAs) to handle each quadrant independently (hardware-level optimization) while maximizing parallelization. To the best of our knowledge, this is the first hardware acceleration work for atom rearrangement, and it significantly reduces processing time. This achievement also contributes to the ongoing efforts of tightly integrating quantum accelerators into High-Performance Computing (HPC) systems. Tested on a Zynq RFSoC FPGA at 250 MHz, our hardware implementation is able to complete the rearrangement process of a 30×30 compact target array, derived from a 50×50 initial loaded array, in approximately 1.0 μs. Compared to a comparable CPU implementation and to state-of-the-art FPGA work, we achieved about 54× and 300× speedups in the rearrangement analysis time, respectively. Additionally, the FPGA-based acceleration demonstrates good scalability, allowing for seamless adaptation to varying sizes of the atom array, which makes this algorithm a promising solution for large-scale quantum systems.
08:45 CEST	TS06.4	IMAGE COMPUTATION FOR QUANTUM TRANSITION SYSTEMS Speaker: Xin Hong, Institute of Software, Chinese Academy of Sciences, CN Authors: Xin Hong¹, Dingchao Gao¹, Sanjiang Li², Shenggang Ying¹ and Mingsheng Ying³ ¹Institute of Software, Chinese Academy of Sciences, CN; ²UTS, AU; ³University of Technology Sydney, CN Abstract With the rapid progress in quantum hardware and software, the need for verification of quantum systems becomes increasingly crucial. While model checking is a dominant and very successful technique for verifying classical systems, its application to quantum systems is still an underdeveloped research area. This paper advances the development of model checking quantum systems by providing efficient image computation algorithms for quantum transition systems, which play a fundamental role in model checking. In our approach, we represent quantum circuits as tensor networks and design algorithms by leveraging the properties of tensor networks and tensor decision diagrams. Our experiments demonstrate that our contraction partition-based algorithm can greatly improve the efficiency of image computation for quantum transition systems.
08:50 CEST	TS06.5	LOW-LATENCY DIGITAL FEEDBACK FOR STOCHASTIC QUANTUM CALIBRATION USING CRYOGENIC CMOS Speaker: Nathan Miller, Georgia Tech, US Authors: Nathan Miller, Laith Shamieh and Saibal Mukhopadhyay, Georgia Tech, US Abstract In order to develop quantum computing systems towards practically useful applications, their physical quantum bits (qubits) must be able to operate with minimal error. Recent work has demonstrated stochastic gate calibration protocols for quantum systems which are meant to track drifting control parameters and tune gate operations to high fidelity. These protocols critically rely on low-latency feedback between the quantum system and its classical control hardware, which is impossible without on-board classical compute from FPGAs or ASICs. In this work, we analyze the performance of a single-shot stochastic calibration protocol for indefinite outcome quantum circuits under various latency conditions based on timing considerations from experimental quantum systems. We also demonstrate the benefits that can be achieved with ASIC implementation of the protocol by synthesizing the classical control logic in a 28 nm CMOS design node, with simulations extended to 14 nm FinFET and at both room and cryogenic temperatures. We show that these classes of quantum calibration protocols can be easily implemented within contemporary control system architectures for low-latency performance without significant power or resource utilization, allowing for the rapid tuning and drift control of any gate-model quantum system towards fault-tolerant computation.
08:55 CEST	TS06.6	IMPROVING FIGURES OF MERIT FOR QUANTUM CIRCUIT COMPILATION Speaker: Patrick Hopf, TU Munich, DE Authors: Patrick Hopf¹, Nils Quetschlich¹, Laura Schulz² and Robert Wille¹ ¹TU Munich, DE; ²Leibniz Supercomputing Centre, DE Abstract Quantum computing is an emerging technology that has seen significant software and hardware improvements in recent years. Executing a quantum program requires the compilation of its quantum circuit for a target Quantum Processing Unit (QPU). Various methods for qubit mapping, gate synthesis, and optimization of quantum circuits have been proposed and implemented in compilers. These compilers try to generate a quantum circuit that leads to the best execution quality - a criterium which is usually approximated by figures of merit such as the number of (two-qubit) gates, the circuit depth, expected fidelity, or estimated success probability. However, it is often unclear how well these figures of merit represent the actual execution quality on a QPU. In this work, we investigate the correlation between established figures of merit and actual execution quality on real machines - revealing that the correlation is weaker than anticipated and that more complex figures of merit are not necessarily more accurate. Motivated by this finding, we propose an improved figure of merit (based on a machine learning approach) that can be used to predict the expected execution quality of a quantum circuit for a chosen QPU without actually executing it. The employed machine learning model reveals the influence of various circuit features on generating high correlation scores. The proposed figure of merit demonstrates a strong correlation and outperforms all previous ones in a case study - achieving an average correlation improvement of 49%.
09:00 CEST	TS06.7	DETERMINISTIC FAULT-TOLERANT STATE PREPARATION FOR NEAR-TERM QUANTUM ERROR CORRECTION: AUTOMATIC SYNTHESIS USING BOOLEAN SATISFIABILITY Speaker: Ludwig Schmid, TU Munich, DE Authors: Ludwig Schmid¹, Tom Peham¹, Lucas Berent¹, Markus Müller² and Robert Wille¹ ¹TU Munich, DE; ²RWTH Aachen University, DE Abstract To ensure resilience against the unavoidable noise in quantum computers, quantum information needs to be encoded using an error-correcting code, and circuits must have a particular structure to be fault-tolerant. Compilation of fault-tolerant quantum circuits is thus inherently different from the non-fault-tolerant case. However, automated fault-tolerant compilation methods are widely underexplored, and most known constructions are obtained manually for specific codes only. In this work, we focus on the problem of automatically synthesizing fault-tolerant circuits for the deterministic initialization of an encoded state for a broad class of quantum codes that are realizable on current and near-term hardware. To this end, we utilize methods based on techniques from classical circuit design, such as satisfiability solving, resulting in tools for the synthesis of (optimal) fault-tolerant state preparation circuits for near-term quantum codes. We demonstrate the correct fault-tolerant behavior of the synthesized circuits using circuit-level noise simulations. We provide all routines as open-source software as part of [retracted for double-blind review] for general use and to foster research in fault-tolerant circuit synthesis.
09:05 CEST	TS06.8	OPTIMIZING QUBIT ASSIGNMENT IN MODULAR QUANTUM SYSTEMS VIA ATTENTION-BASED DEEP REINFORCEMENT LEARNING Speaker: Enrico Russo, University of Catania, IT Authors: Enrico Russo, Maurizio Palesi, Davide Patti, Giuseppe Ascia and Vincenzo Catania, University of Catania, IT Abstract Modular, distributed, and multi-core architectures are considered a promising solution for scaling quantum computing systems. Optimising communication is crucial to preserve quantum coherence. The compilation and mapping of quantum circuits should minimise state transfers while adhering to architectural constraints. To address this problem efficiently, we propose a novel approach using Reinforcement Learning (RL) to learn heuristics for a specific multi-core architecture. Our RL agent uses a Transformer encoder and Graph Neural Networks, encoding quantum circuits with self-attention and producing outputs via an attention-based pointer mechanism to match logical qubits with physical cores efficiently. Experimental results show our method outperform the baseline reducing by 28% inter-core communications for random circuits while minimising time-to-solution.
09:10 CEST	TS06.9	NEURAL CIRCUIT PARAMETER PREDICTION FOR EFFICIENT QUANTUM DATA LOADING Speaker: Dohun Kim, Pohang University of Science and Technology, KR Authors: Dohun Kim, Sunghye Park and Seokhyeong Kang, Pohang University of Science and Technology, KR Abstract Quantum machine learning (QML) has demonstrated the potential to outperform classical machine learning algorithms in various fields. However, encoding classical data into quantum states, known as quantum data loading, remains a challenge. Existing methods achieve high accuracy in loading single data, but lack efficiency for large-scale data loading tasks. In this work, we propose Neural Circuit Parameter Prediction, a novel method that leverages classical deep neural networks to predict the parameters of parameterized quantum circuits directly from the input data. This approach benefits from the batch inference capability of neural networks and improves the accuracy of quantum data loading. We introduce real-valued parameterization of quantum circuits and a three-phase training strategy to further enhance training efficiency and accuracy. Experimental results on MNIST dataset show that our method achieves a 17.31% improvement in infidelity score and 108 times faster runtime compared to existing methods. Our approach provides an efficient solution for quantum data loading, enabling the practical deployment of QML algorithms on large-scale datasets
09:15 CEST	TS06.10	CIM-BASED PARALLEL FULLY FFNN SURFACE CODE HIGH-LEVEL DECODER FOR QUANTUM ERROR CORRECTION Speaker: Hao Wang, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Hao Wang¹, Erjia Xiao¹, Songhuan He², Zhongyi Ni¹, Lingfeng Zhang¹, Xiaokun Zhan³, Yifei Cui², Jinguo Liu¹, Cheng WANG², Zhongrui Wang⁴ and Renjing Xu¹ ¹The Hong Kong University of Science and Technology (Guangzhou), CN; ²University of Electronic Science and Technology of China, CN; ³Harbin Institute of Technology, CN; ⁴Southern University of Science and Technology, CN Abstract In all types of surface code decoders, fully neural network-based high-level decoders offer decoding thresholds that surpass decoder-Minimum Weight Perfect Matching (MWPM), and exhibit strong scalability, making them one of the ideal solutions for addressing surface code challenges. However, current fully neural network-based high-level decoders can only operate serially and do not meet the current latency requirements (below 440 ns). To address these challenges, we first propose a parallel fully feedforward neural network (FFNN) high-level surface code decoder, and comprehensively measure its decoding performance on a computing-in-memory (CIM) hardware simulation platform. With the currently available hardware specifications, our work achieves a decoding threshold of 14.22%, and achieves high pseudo-thresholds of 10.4%, 11.3%, 12%, and 11.6% with decoding latencies of 197.03 ns, 234.87 ns, 243.73 ns, and 251.65 ns for distances of 3, 5, 7 and 9, respectively.

TS07 Applications of emerging technologies

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Salon Pasteur

Session chair:
Elena-Ioana Vătăjelu, CNRS / TIMA, FR

Session co-chair:
Alberto Bosio, École Centrale de Lyon, FR

Time	Label	Presentation Title Authors
08:30 CEST	TS07.1	HYPERDYN: DYNAMIC DIMENSIONAL MASKING FOR EFFICIENT HYPER-DIMENSIONAL COMPUTING Speaker: Fangxin Liu, Shanghai Jiao Tong University, CN Authors: Fangxin Liu, Haomin Li, Zongwu Wang, Dongxu Lyu and Li Jiang, Shanghai Jiao Tong University, CN Abstract Hyper-dimensional computing (HDC) is a bio-inspired computing paradigm that mimics cognitive tasks by encoding data into high-dimensional vectors and employing non-complex learning techniques. However, existing HDC solutions face a major challenge hindering their deployment on low-power embedded devices: the costly associative search module, especially in high-precision computations. This module involves calculating the distance between class vectors and query vectors, as well as sorting distances In this paper, we present HyperDyn, an efficient dynamic inference framework designed for accurate and efficient hyper-dimensional computing. Our framework first offline analyzes the importance of different dimensions in the associative memory based on the contributions of the dimensions to the classification accuracy. In addition, we introduce a dynamic dimensional importance scaling mechanism for more flexible and accurate dimension contribution judgments. Finally, HyperDyn achieves efficient dynamic associative search through a dimension masking mechanism that adapts to the characteristics of the input sample. We evaluate HyperDyn on datasets from three different fields and the results show that HyperDyn can achieve $7.65 imes$ speedup and $58\%$ energy savings, with less than $0.2\%$ loss in accuracy.
08:35 CEST	TS07.2	C3CIM: CONSTANT COLUMN CURRENT MEMRISTOR-BASED COMPUTATION-IN-MEMORY MICRO-ARCHITECTURE Speaker: Yashvardhan Biyani, TU Delft, NL Authors: Yashvardhan Biyani, Rajendra Bishnoi, Said Hamdioui and Theofilos Spyrou, TU Delft, NL Abstract Advancements in Artificial Intelligence (AI) and Internet-of-Things (IoT) have increased demand for edge AI, but deployment on traditional AI accelerators, like GPUs and TPUs, using von Neumann architecture, suffer from inefficiencies due to separate memory and compute units. Computation-in-Memory (CIM), utilizing non-volatile memristor devices to leverage analog computing principles and perform in-place computations, holds great potential in improving computational efficiency by eliminating frequent data movement. However, standard implementation of CIM faces several challenges, primarily high power consumption and subsequently induced non-linearity, debating its viability for edge devices. In this paper, we propose C3CIM, a novel memristor-based CIM micro-architecture, featuring a new bit-cell and array design, targeting efficient implementation of Neural Networks (NN). Our architecture uses a constant current source to perform Multiply-and-Accumulate (MAC) operations with a very low computation current (10 to 100 nA), thereby significantly enhancing power efficiency. We adapted C3CIM for Spiking Neural Networks (SNN) and developed a prototype using TSMC 40nm CMOS node for on-silicon validation. Furthermore, our micro-architecture was benchmarked using two SNN models based on N-MNIST and IBM-Gesture datasets, for comparison against current state-of-the-art (SOTA). Results show up to 35x reduction in power along with 6.7x saving in energy compared to SOTA, demonstrating promising potential of this work for edge AI applications.
08:40 CEST	TS07.3	ASNPC: AN AUTOMATED GENERATION FRAMEWORK FOR SNN AND NEUROMORPHIC PROCESSOR CO-DESIGN Speaker: Xiangyu Wang, National University of Defense Technology, CN Authors: Xiangyu Wang¹, Yuan Li², Zhijie Yang³, Chao Xiao¹, Xun Xiao¹, Renzhi Chen⁴, Weixia Xu¹ and Lei Wang³ ¹National University of Defense Technology, CN; ²College of Computer, National University of Defense Technology, CN; ³Academy of Military Sciences, CN; ⁴qiyuan laboratory, CN Abstract Spiking neural networks (SNNs) are promisingly considered as energy-efficient alternatives to traditional deep neural networks. At the same time, neuromorphic processors have garnered increasing attention to support the efficient execution of SNNs. However, current works always separate their design to primarily prioritize a single criterion. Hardware-algorithm co-design allows for the simultaneous consideration of hardware and algorithm characteristics during the design process, effectively reducing resource usage while optimizing the algorithm's performance. In light of this, we developed a hardware-algorithm co-design framework named ASNPC for SNNs and neuromorphic processors. Considering the vast mixed-variable co-design space and the time-expensive function evaluations, we employed the surrogate-based multi-objective optimization algorithm MOTPE to identify Pareto solutions that balance algorithm performance and hardware costs. To rapidly obtain hardware results, we designed an end-to-end methodology that can automatically generate the Register-Transfer Level (RTL) code for neuromorphic processors corresponding to each candidate using templates from the hardware library. The evaluated hardware metrics, such as hardware resource and power consumption, are then fed back to MOTPE for the next candidate selection. Compared to existing works, the proposed approach exhibits the ability to find better Pareto solutions, balancing hardware costs and accuracy within a limited search budget, making it widely applicable to various application scenarios. Additionally, under the same hardware configuration, the neuromorphic processor we generated achieves lower hardware resource usage and higher throughput.
08:45 CEST	TS07.4	SIMULTANEOUS DENOISING AND COMPRESSION FOR DVS WITH PARTITIONED CACHE-LIKE SPATIOTEMPORAL FILTER Speaker: Qinghang Zhao, Xidian University, CN Authors: Qinghang Zhao, Yixi Ji, Jiaqi Wang, Jinjian Wu and Guangming Shi, Xidian University, CN Abstract Dynamic vision sensor (DVS) is a novel neuromorphic imaging device that asynchronously generates event data corresponding to changes in light intensity at each pixel. However, the differential imaging paradigm of DVS renders it highly sensitive to background noise. Additionally, the substantial volume of event data produced in a very short time presents significant challenges for data transmission and processing. In this work, we present a novel spatiotemporal filter design, named PCLF, to achieve simultaneous denoising and compression for the first time. The PCLF employs a hierarchical memory structure that utilizes symmetric multi-bank cache-like row and column memories to store event data from a partitioned pixel array, which exhibits low memory complexity of O(m+n) for an m×n DVS. Furthermore, we propose a probability-based criterion to effectively control the compression ratio. We have implemented our design on an FPGA, demonstrating capabilities for real-time operation (≤60 ns) and low power consumption (<200 mW). Extensive experiments conducted on real-world DVS data across various tasks indicate that our design enables a reduction of event data by 30% to 68%, while maintaining or even enhancing the performance of the tasks.
08:50 CEST	TS07.5	PRACTICAL MU-MIMO DETECTION AND LDPC DECODING THROUGH DIGITAL ANNEALING Speaker: Po-Shao Chen, University of Michigan, US Authors: Po-Shao Chen, Wei Tang and Zhengya Zhang, University of Michigan, US Abstract Digital annealing has been successfully applied to solving combinatorial optimization (CO) problems. It is more flexible, robust, and easier to deploy on edge platforms compared to its counterparts including quantum annealing and analog and in-memory Ising machines. In this work, we apply digital annealing to compute-intensive communication digital signal processing problems, including multi-user detection in multiple-input and multiple-output (MU-MIMO) wireless communication systems and decoding low-density parity-check (LDPC) codes. We show that digital annealing can achieve near maximum likelihood (ML) accuracy for MIMO detection with even lower complexity than the conventional minimum mean square error (MMSE) detection. In LDPC decoding, we enhance digital annealing by introducing a new cost function that improves decoding accuracy and reduces computational complexity compared to the standard formulations.
08:55 CEST	TS07.6	LLM-SRAF: SUB-RESOLUTION ASSIST FEATURE GENERATION USING LARGE LANGUAGE MODEL Speaker: Tianyi Li, ShanghaiTech University, Pl Authors: Tianyi Li¹, Zhexin Tang¹, Tao Wu¹, Bei Yu², Jingyi Yu¹ and Hao Geng¹ ¹ShanghaiTech University, CN; ²The Chinese University of Hong Kong, HK Abstract As integrated circuit (IC) feature sizes continue to shrink, using sub-resolution assist features (SRAF) becomes increasingly crucial for improving wafer pattern resolution and fidelity. However, model-based SRAF insertion techniques, while accurate, require substantial computational resources and are often impractical for industrial scenarios. This demands more efficient and industry-compatible methods that maintain high performance. In this work, we introduce LLM-SRAF, a novel framework for SRAF generation driven by a large language model fine-tuned on an SRAF dataset. LLM-SRAF accepts semantic prompt inputs, including SRAF generation task descriptions, OPC recipe, lithography conditions, mask rules, and sequential layout descriptions, to directly generate SRAFs. Both supervised fine-tuning and reinforcement learning with human feedback (RLHF) are employed to enable the model to acquire domain-specific knowledge and specialize in SRAF generation. Experimental results show that LLM-SRAF outperforms existing state-of-the-art methods in metrics of mask quality, including edge placement error (EPE) and process variation band (PVB) area. Moreover, the runtime of LLM-SRAF is also 3x faster compared to the Calibre commercial tool.
09:00 CEST	TS07.7	A MULTI-STAGE POTTS MACHINE BASED ON COUPLED CMOS RING OSCILLATORS Speaker: Yilmaz Ege Gonul, Drexel University, US Authors: Yilmaz Gonul and Baris Taskin, Drexel University, US Abstract This work presents a multi-stage coupled ring oscillator based Potts machine, designed with phase-shifted Sub-Harmonic-Injection-Locking (SHIL) to represent multivalued Potts spins at different solution stages with oscillator phases. The proposed Potts machine is able to solve a certain class of combinatorial optimization problems that natively require multivalued spins with a divide-and-conquer approach, facili tated through the alternating phase-shifted SHILs acting on the oscillators. The proposed architecture eliminates the need for any external intermediary mappings or usage of external memory, as the influence of SHIL allows oscillators to act as both memory and computation units. Planar 4-coloring problems of sizes up to 2116 nodes are mapped to the proposed architecture. Simulations demonstrate that the proposed Potts machine provides exact solutions for smaller problems (e.g. 49 nodes) and generates solutions reaching up to 97% accuracy for larger problems (e.g. 2116 nodes).
09:05 CEST	TS07.8	ADAPT-PNC: MITIGATING DEVICE VARIABILITY AND SENSOR NOISE IN PRINTED NEUROMORPHIC CIRCUITS WITH SO ADAPTIVE LEARNABLE FILTERS Speaker: Tara Gheshlaghi, KIT - Karlsruher Institut für Technologie, DE Authors: Tara Gheshlaghi¹, Priyanjana Pal¹, Haibin Zhao¹, Michael Hefenbrock², Michael Beigl¹ and Mehdi Tahoori¹ ¹Karlsruhe Institute of Technology, DE; ²RevoAI GmbH, DE Abstract The rise of the Internet of Things demands flexible, biocompatible, and cost-effective devices. Printed electronics provide a solution through low-cost and on-demand additive manufacturing on flexible substrates, making them ideal for IoT applications. However, variations in additive manufacturing processes pose challenges for reliable circuit fabrication. Adapting neuromorphic computing to printed electronics could address these issues. Printed neuromorphic circuits offer robust computational capabilities for near-sensor processing in IoT. One limitation of existing printed neuromorphic circuits is their inability to process temporal sensory inputs. To address this, integrating temporal components in printed neuromorphic circuit architectures enables the effective processing of time-series sensory data. Printed neuromorphic circuits face challenges from manufacturing variations such as ink dispersion, sensor noise, and temporal fluctuations, especially when processing temporal data and using time-dependent components like capacitors. To mitigate these challenges, we propose robustness-aware temporal processing neuromorphic circuits with low-pass second-order learnable filters (SO-LF). This approach integrates variation awareness by considering the variation potential of component values during training and using data augmentation to enhance adaptability against physical and sensor data variations. Simulations on 15 benchmark time-series datasets show our circuit effectively handles noisy temporal information under 10% process variations, achieving an average accuracy and power improvement of ≈ 24.7% and ≈ 91% respectively compared to models lacking variation with ≈ 1.9× more devices.
09:10 CEST	TS07.9	SELF-ADAPTIVE ISING MACHINES FOR CONSTRAINED OPTIMIZATION Speaker and Author: Corentin Delacour, University of California, Santa Barbara, US Abstract Ising machines (IMs) are physics-inspired alternatives to von Neumann architectures for solving hard optimization tasks. By mapping binary variables to coupled Ising spins, IMs can naturally solve unconstrained combinatorial optimization problems such as finding maximum cuts in graphs. However, despite their importance in practical applications, constrained problems remain challenging to solve for IMs that require large quadratic energy penalties to ensure the correspondence between energy ground states and constrained optimal solutions. To relax this requirement, we propose a self-adaptive IM that iteratively shapes its energy landscape using a Lagrange relaxation of constraints and avoids prior tuning of penalties. Using a probabilistic-bit (p-bit) IM emulated in software, we benchmark our algorithm with multidimensional knapsack problems (MKP) and quadratic knapsack problems (QKP), the latter being an Ising problem with linear constraints. For QKP with 300 variables, the proposed algorithm finds better solutions than state-of-the-art IMs such as Fujitsu's Digital Annealer and requires 7,500x fewer samples. Our results show that adapting the energy landscape during the search can speed up IMs for constrained optimization.
09:15 CEST	TS07.10	ENABLING SNN-BASED NEAR-MEA NEURAL DECODING WITH CHANNEL SELECTION: AN OPEN-HW APPROACH Speaker: Gianluca Leone, Università degli Studi di Cagliari, IT Authors: Gianluca Leone, Luca Martis, Luigi Raffo and Paolo Meloni, Università degli Studi di Cagliari, IT Abstract Advancements in CMOS microelectrode array sensors have significantly improved sensing area and resolution, paving the way to accurate Brain-Machine Interfaces (BMIs). However, near-sensor neural decoding on implantable computing devices is still an open problem. A promising solution is provided by Spiking Neural Networks (SNNs), which leverage event sparsity to improve energy consumption. However, given the typical data rates involved, the workload related to I/O acquisition and spike encoding is dominant and limits the benefits achievable with event-based processing. In this work, we present two power-efficient implementations, on FPGA and ASIC, of a dedicated processor for the decoding of intracortical action potentials from primary motor cortex. The processor leverages lightweight sparse SNNs to achieve state-of-the-art accuracy. To limit the impact of I/O transfers on energy efficiency, we introduced a channel selection scheme that reduced bandwidth requirements by 3x and power consumption by 2.3x and 1.6x on the FPGA and ASIC, respectively, enabling inference at 0.446 µJ and 1.04 µJ, with no significant loss in accuracy. To promote broad adoption in a specialized, research-intensive domain, we have based our implementations on open-source EDA tools, low-cost hardware, and an open PDK.
09:20 CEST	TS07.11	TOWARDS FAST AUTOMATIC DESIGN OF SILICON DANGLING BOND LOGIC Speaker: Jan Drewniok, TU Munich, DE Authors: Jan Drewniok¹, Marcel Walter¹, Samuel Ng², Konrad Walus² and Robert Wille¹ ¹TU Munich, DE; ²University of British Columbia, CA Abstract In recent years, Silicon Dangling Bond (SiDB) logic has emerged as a promising beyond-CMOS technology. Unlike conventional circuit technology, where logic is realized through transistors, SiDB logic utilizes quantum dots with variable charge states. By strategically arranging these dots, logic functions can be constructed. However, determining such arrangements is a tremendously complex task. Because of that, the automatic obtainment of SiDB logic implementations is inefficient. To address this challenge, we propose an idea to speed up the design process by utilizing dedicated search space pruning strategies. Initial results show that the combined pruning techniques yield 1) a drastic reduction of the search space, and 2) a corresponding reduction in runtime by up to a factor of 33.
09:21 CEST	TS07.12	LOADING-AWARE MIXING-EFFICIENT SAMPLE PREPARATION ON PROGRAMMABLE MICROFLUIDIC DEVICE Speaker: Debraj Kundu, TU Munich, DE Authors: Debraj Kundu¹, Tsun-Ming Tseng², Shigeru Yamashita³ and Ulf Schlichtmann² ¹TU Munich (TUM), DE; ²TU Munich, DE; ³Ritsumeikan University, JP Abstract Sample preparation, where a certain number of reagents must be mixed in a specific volumetric ratio, is an integral step for various bio-assays. A programmable microfluidic device (PMD) is an advanced flow-based microfluidic biochip (FMB) platform, that considered to be very effective for sample preparation. However, the impact of mixer placement, reagents' distribution, and mixing time on the automation of sample preparation has not yet been investigated. We consider a mixing efficiency model controlled by the number of alternations "μ" of reagents along the mixing circulation path and propose a loading-aware placement strategy that maximizes the mixing efficiency. We use satisfiability modulo theories (SMT) and propose a one-pass strategy for placing the mixers and the reagents, that successfully enhance the loading and mixing efficiencies.

W02 Heterogeneous Integration: from advanced 3D technology to innovative computing architectures

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 08:30 CEST - 12:30 CEST
Location / Room: St Clair 2

Organiser:
Pascal Vivet, CEA-Leti, FR

Workshop Handouts

The workshop handouts is now available, with abstracts, bios, and slides, see download link above.

Download of the workshop handouts is possible above after login at https://date25.date-conference.com/workshop/w02.

You have been informed about your login credentials after your registration for the conference.

If you can't see the download link above after successful login, please contact webmasterdate-conference [dot] com.

Workshop Description

3D technologies are becoming more and more pervasive in digital architectures, as a strong enabler for heterogeneous integration. With the limits of current sub-nanometric technologies, 3D integration technology is paving the way to a wide architecture scope, with reduced cost, reduced form factor, increased energy efficiency, allowing a wide variety of heterogeneous architectures. Due to the high amount of required data and associated memory capacity, ML and AI accelerator could benefit of 3D integration not only for HPC, but also for the edge and embedded HPC. 3D integration and associated architectures are opening a wide spectrum of system solutions, from chiplet-based partitioning for High Performance Computing to various sensors such as fully integrated image sensors embedding AI features, but also but also for next generation of computing architectures: AI accelerators, InMemoryComputing, Quantum, etc.

The goal of the 3D Integration Workshop is to bring together experts from both academia and industry, interested in this exciting and rapidly evolving field, in order to update each other on the latest state-of-the-art, exchange ideas, and discuss future challenges.

This half-day event consists of a plenary keynote, invited talks, and a panel.

Workshop Committee

Pascal Vivet, CEA-Leti, FR, Co-general Chair
Dragomir Milojevic, IMEC/ULB, BE, Co-general Chair
Gianna Paulin, AXELERA AI, CH, Co-program Chair
Peter Ramm, Fraunhofer EMFT, DE, Co-program Chair

Technical Program

8:30 : Workshop opening

8:30 - 9:00 : Keynote

Session Chair : Pascal Vivet, CEA-Leti, FR

“Status of High-End Performance Packaging (2.5D & 3D) - Technology and Market Trends”

Stefan Chitoraga, Yole Group, France

9:00 - 10:00 : Session 1 : Chiplet and 3D solutions for Automotive, Space and AR/VR applications

Session Chair : Gianna Paulin, Axelera AI, CH

“Challenges of future automotive electronics and solutions for the early evaluation of chiplet-based system concepts”, Peter Schneider, Fraunhofer IIS & TU Dresden, DE
“3D/2.5D Heterogeneous Circuits for Space and Automotive”, Fady Abouzeid, STMicroelectronics, FR
“Advancing Space Electronics with Chiplet-Based Architectures”, Denis Dutoit, CEA-List, FR
“Opportunities for 3D Integration to Enable Augmented Reality Glasses”, Tony WU, META, USA

10:00 - 11:00 : Coffee Break

11:00 - 12:00 : Session 2 : Advanced system integration for Quantum processors, and Innovative 3D Design Flows

Session Chair : Peter Ramm, Fraunhofer EMFT, DE

“Heterogeneous integration technology in large-scale trapped ion quantum processors”, Bassem Badawi, University of Innsbruck, AT
“3D implementation flows for multi-die Logic-on-Logic & CMOS2.0”, Dragomir Milojevic, Univ ULB & IMEC, BE
“Addressing Thermal and Stress Effects in 3D ICs”, Nermeen Hossan, SIEMENS EDA, Egypt
“PPAT assessment for multi-core SOC in 2D and 3D using electro-thermal co-simulation flow”, Selim Abou Samra, CADENCE, FR

12:00 - 12:30 : PANEL

« Heterogeneous Integration for an open European Chiplet Ecosystem »

Panel chairs : Pascal Vivet, Gianna Paulin

Panelists :

Peter SCHNEIDER, Fraunhofer IIS-EAS, DE
Dragomir MILOJEVIC, Université Libre de Bruxelles, IMEC, Leuven, BE
Denis DUTOIT, CEA-LIST, FR
Fady ABOUZEID, STMicroelectronics, FR
Stefan CHITORAGA, YOLE, FR

Past editions:

The 3D Integration workshop took place from 2009 to 2015, and restarted in 2022.

DATE 2009 https://past.date-conference.com/date09/conference/workshop-W5

DATE 2010 https://past.date-conference.com/date10/conference/workshop-W5 (broken link)

DATE 2011 https://past.date-conference.com/date11/conference/workshop-W5 (broken link)

DATE 2012 https://past.date-conference.com/date12/conference/workshop-W5

DATE 2013 https://past.date-conference.com/date13/conference/workshop-W5

DATE 2014 https://past.date-conference.com/date14/conference/workshop-W5

DATE 2015 https://past.date-conference.com/date15/conference/workshop-W05

DATE 2022 https://date22.date-conference.com/workshop/w02

DATE 2023 https://date23.date-conference.com/workshop/w02

DATE 2024 https://date24.date-conference.com/workshop/w02

ASD05 ASD focus session: Teleoperation as a Step Towards Fully Autonomous Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Roseraie 1&2

Organisers:
Frank Diermeyer, TU Munich, DE
Rolf Ernst, TU Braunschweig, DE

In the foreseeable future, highly automated mobile systems, such as vehicles, robots, UAVs, or trains, will be confronted with difficult situations that require external support. The availability of such external support corresponds to level 4 driving automation and is an essential feature in current robotaxis and automated public transportation. While the first generation of level 4 prototypes relied on safety driver support, commercial systems are gradually moving towards support by teleoperation. Designing teleoperation support for level 4 systems is an end-to-end problem involving two main research and practical challenges, the teleoperation function defining the remote human interface with its scene representation and available control functions, and the real-time communication channel involving wired and wireless segments, which must provide reliable end-to-end data transport.

Time	Label	Presentation Title Authors
11:00 CEST	ASD05.1	AUTOMATED VEHICLE TELEOPERATION – VISION AND CHALLENGES. Presenter: Frank Diermeyer, TU Munich, DE Author: Frank Diermeyer, TU Munich, DE Abstract This talk will introduce the fundamentals and challenges of vehicle teleoperation support in the context of highly automated driving. It will explain the system architecture and the safety concept in the context of limited communication guarantees. It will conclude with upcoming trends of teleoperation using the rapidly growing vehicle sensing capabilities.
11:10 CEST	ASD05.2	THE ROLE OF REMOTE OPERATION IN ENABLING AUTONOMOUS DRIVING BUSINESS CASES 2025+ Presenter: Arwed Schmidt, EasyMile, DE Author: Arwed Schmidt, EasyMile, DE Abstract .
11:20 CEST	ASD05.3	RELIABLE REAL-TIME COMMUNICATION FOR TELEOPERATION Presenter: Selma Saidi, Technische UniversitÃ¤t Braunschweig, DE Authors: Selma Saidi¹ and Rolf Ernst² ¹Technische UniversitÃ¤t Braunschweig, DE; ²TU Braunschweig, DE Abstract This talk approaches reliable communication operation from two perspectives. The first one is the application perspective requiring reliable end-to-end connection over heterogeneous networks including handover scenarios under low-latency constraints. The second one is the network perspective addressing suitable wireless resource assignment in cellular (5G and 6G) and WLAN V2X networks as a basis for such applications.
11:30 CEST	ASD05.4	PANEL DISCUSSION Presenter: All the Panelists, DATE 2025, FR Author: All the Panelists, DATE 2025, FR Abstract .

FS06 Focus Session: Improving Chip Design Enablement for Universities in Europe

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 1

Session chair:
Ulf Schlichtmann, TU Munich, DE

Session co-chair:
Holger Blume, Leibniz University Hannover, DE

Organisers:
Norbert Wehn, University of Kaiserslautern-Landau, DE
Lukas Krupp, University of Kaiserslautern-Landau, DE

Time	Label	Presentation Title Authors
11:00 CEST	FS06.1	PANEL: IMPROVING CHIP DESIGN ENABLEMENT FOR UNIVERSITIES IN EUROPE Speaker: Norbert Wehn, RPTU University of Kaiserslautern-Landau, DE Authors: Matthew Venn¹, Joachim Rodrigues², David Atienza³, Ian O'Connor⁴, Andreas Brüning⁵ and Patrick Haspel⁶ ¹Tiny Tapeout, ES; ²Lund University, SE; ³EPFL, CH; ⁴Lyon Institute of Nanotechnology, FR; ⁵FMD, DE; ⁶Synopsys, DE Abstract The semiconductor industry is central to the European economy, particularly in the industrial and automotive sectors. Semiconductor fabrication and chip design are the two largest segments of the microelectronics value chain. While Europe is strengthening semiconductor fabrication and technology with considerable investments, e.g., in new fabs, chip design capabilities fall far short of the required capacities. The EU MicroElectronics Training, Industry and Skills (METIS) Report 2023 has shown that chip designers are the job profiles identified as the most difficult to find in the European microelectronics industry. European universities face many challenges hindering their ability to produce skilled graduates and contribute to the semiconductor ecosystem. While student interest in, e.g., AI is booming, we observe a decreasing interest in microelectronics. The main reasons for this are the high entry barriers for students, reinforced by the lack of chip design enablement in academia. Hence, there are ongoing initiatives in different European countries, on the EU level, and worldwide to strengthen chip design education and research. This focus session will bring together stakeholders of these initiatives from Europe and the USA to explore the critical challenges, opportunities, and potential strategies facing chip design enablement in European academic institutions. The session will be held in the panel format with active audience participation to guarantee inclusiveness and foster a broad view of the topic.

MPP01 Driving KDT JU Initiative towards the Chips Act Multi-Partner Projects

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 2

Session chair:
Juan Antonio Montiel Nelson, University of Las Palmas de Gran Canaria, ES

Session co-chair:
Franco Fummi, Università di Verona, IT

Time	Label	Presentation Title Authors
11:00 CEST	MPP01.1	MULTI-PARTNER PROJECT: A MODEL-DRIVEN ENGINEERING FRAMEWORK FOR FEDERATED DIGITAL TWINS OF INDUSTRIAL SYSTEMS (MATISSE) Speaker: Djamel Eddine Khelladi, CNRS, University of Rennes, FR Authors: Alessio Bucaioni¹, Romina Eramo², Luca Berardinelli³, Hugo Bruneliere⁴, Benoit Combemale⁵, Djamel Khelladi⁵, Vittoriano Muttillo², Andrey Sadovykh⁶ and Manuel Wimmer³ ¹Mälardalen University, SE; ²University of Teramo, IT; ³JKU, AT; ⁴IMT atlantique, FR; ⁵IRISA, FR; ⁶Softeam, FR Abstract Digital twins are virtual representations of real-world entities or systems. Their primary goal is to help organizations understand and predict the behaviour and properties of these entities or systems. Additionally, digital twins enhance activities such as monitoring, verification, validation, and testing. However, the inherent complexity of digital twins implies challenges throughout the systems engineering process. This notably includes design, development, and analysis phases, as well as deployment, execution, and maintenance. Moreover, existing approaches, methods, techniques, and tools for modelling, simulating, validating, and monitoring single digital twins must now address the increased complexity in federation scenarios. These scenarios introduce new challenges, such as digital twin identification, shared metadata, cross-digital twin communication and synchronization, and federation governance. The KDT Joint Undertaking MATISSE project tackles these challenges by aiming to provide a model-driven framework for the continuous engineering of federated digital twins. It leverages model-driven engineering techniques and practices as the core enabling technology, with traceability serving as an essential infrastructural service for the digital twins federation. In this paper, we introduce the MATISSE conceptual framework for digital twins, highlighting both the novelty of the project's concept and its technical objectives. As the project is still in its initial phase, we identify key research challenges relevant to the DATE community and propose a preliminary research roadmap. This roadmap addresses traceability and federation mechanisms, the required continuous engineering strategy, and the development of digital twin-based services for verification, validation, prediction, and monitoring. To illustrate our approach, we present two concrete scenarios that demonstrate practical applications of the MATISSE conceptual framework.
11:05 CEST	MPP01.2	MULTI-PARTNER PROJECT: ELECTRIC VEHICLE DATA ACQUISITION AND VALORISATION: A PERSPECTIVE FROM THE OPEVA PROJECT Speaker: Gianluigi Ferrari, University of Parma, IT Authors: Alper Kanak¹, Salih Ergün², İbrahim Arif³, Ali Serdar Atalay⁴, Serhat Ege İnanç⁴, Oguzhan Herkiloğlu⁵, Ahmet Yazıcı⁶, Yunus Sabri Kirca⁶, Muhammed Ozberk⁷, Kerem Sarı⁷, Ali Kafalı⁷, Dilara Bayar⁷, Muhammed Oğuz Taş⁸, Luca Davoli⁹, Laura Belli⁹, Gianluigi Ferrari⁹, Badar Muneer¹⁰, Valentina Palazzi⁹, Luca Roselli¹⁰ and Fabio Gelati¹¹ ¹Ergünler R&D Co.Ltd., TR; ²Ergünler R&D Co. Ltd., TR; ³Ergtech SP.Z.O.O., PL; ⁴AI4SEC OÖ, EE; ⁵Bitnet Bilişim Hizmetleri Ltd., TR; ⁶Eskişehir Osmangazi University, TR; ⁷ACD Data Engineering, TR; ⁸INO Robotics, TR; ⁹University of Parma, IT; ¹⁰University of Perugia, IT; ¹¹Luna Geber Engineering s.r.l., IT Abstract The OPtimization of Electric Vehicle Autonomy (OPEVA) project enhances data aggregation for Electric Vehicles (EVs) by collecting critical real-time data (i.e., vehicle performance, battery health, charging behaviours) through heterogeneous data acquisition devices built on robust HW and integrated with Internet of Things (IoT) protocols. By combining internal sensor data and driver-specific behaviours with external information (e.g., road conditions, charging station availability), OPEVA maximizes vehicles performance, establishing secure and seamless data communication between EVs and the infrastructure, and using IoT and cloud computing tools alongside Vehicle-to-Everything (V2X) devices and networks. This paper focuses on the extensible data model ensuring semantic data integrity considering extit{in-} and extit{out-vehicle} factors, presenting data acquisition solutions dealing with OPEVA's semantic data model and their use in various Artificial Intelligence (AI)-powered use cases (e.g., range prediction, route optimization, battery management).
11:10 CEST	MPP01.3	MULTI-PARTNER PROJECT: A DEEP LEARNING PLATFORM TARGETING EMBEDDED HARDWARE FOR EDGE-AI APPLICATIONS (NEUROKIT2E) Speaker: Rajendra Bishnoi, TU Delft, NL Authors: Rajendra Bishnoi¹, Mohammad Yaldagard¹, Kanishkan Vadivel², Manolis Sifalakis², Nicolas Rodriguez³, Pedro Julian⁴, Lothar Ratschbacher³, Maen Malla⁵, Yogesh Pati⁵, Rashid Ali⁵ and Fabian Chersi⁶ ¹TU Delft, NL; ²IMEC Netherlands, NL; ³Silicon Austria Labs, AT; ⁴Universidad Nacional del Sur IIIE-DIEC, AR; ⁵Fraunhofer IIS, DE; ⁶CEA, FR Abstract The goal of the NEUROKIT2E project (EU HORIZON-JU-RIA) is to create an open-source Deep Learning framework for edge and embedded AI, built around an established European value chain. This framework supports a wide range of application areas that operate independently and serve a global user community. It provides easy and fast full-stack solutions, from AI application development to Neural Network design and optimization, all the way down to hardware implementations, while enabling code generation for application-specific targets. This platform provides flexibility for academic users in the AI domain to explore and innovate while allowing them the possibility to prototype systems, ensuring their work aligns well with industrial needs. This paper presents the results and achievements of the first part of this three-year project, along with its roadmap and expected outcomes.
11:15 CEST	MPP01.4	MULTI-PARTNER PROJECT: SPORTS PERFORMANCE AND HEALTH ASSESSMENT IN THE DISTRIMUSE PROJECT Speaker: Gianluigi Ferrari, University of Parma, IT Authors: Luca Davoli¹, Laura Belli¹, Veronica Mattioli¹, Gianluigi Ferrari¹, Lorenzo Priano², Jaromir Hubalek³, Lukáš Smital³, Andrea Němcová³, Daniela Chlíbková³, Vlastimil Benes⁴ and Johan Plomp⁵ ¹University of Parma, IT; ²University of Turin, IT; ³Brno University of Technology, CZ; ⁴IMA s.r.o., CZ; ⁵VTT Oy, FI Abstract In our increasingly tech-saturated world, from mobile apps and health sensors to autonomous cars and factory robots, we expect these devices to seamlessly integrate into our lives, enhancing safety and convenience. However, as these devices proliferate and their autonomy grows, ensuring they provide unobtrusive, yet effective support becomes crucial. The Horizon Europe KST multi-partner project "Distributed Multi-Sensor Systems for Human Safety and Health" (DistriMuSe) intends to support human health and safety by improved sensing of human presence, behaviour, and vital signs in a collaborative or common environment by means of multi-sensor systems, distributed processing and Machine/Deep Learning (ML/DL) techniques. In this paper, we focus on the DistriMuSe's approach on sports performance and health assessment, focusing on monitoring the physical activity of non-professional and hobby athletes, people who like sports and care about their health, elderly healthy people, and subjects affected by neurological disability (e.g., Parkinson's disease). The overall goal is to measure activity and exertion, estimating performance levels and determining maximum effort. We discuss the overall system-of-systems architecture, focusing on the adopted technologies.
11:20 CEST	MPP01.5	MULTI-PARTNER PROJECT: ADVANCING THE EDA TOOLS LANDSCAPE FOR THE EUROPEAN RISC-V ECOSYSTEM IN TRISTAN Speaker: Bernhard Fischer, Siemens, AT Authors: Fatma Jebali¹, Caaliph Andriamisaina², Mathieu Jan², Wolfgang Ecker³, Florian Egert⁴, Bernhard Fischer⁴, Alessio Burrello⁵, Daniele Jahier Pagliari⁵, Sara Vinco⁵, Giuseppe Tagliavini⁶, Ingo Feldner⁷, Andreas Mauderer⁷, Axel Sauer⁷, Arnór Kristmundsson⁸, Alexander Schober⁸, Téo Bernier⁹, Matti Käyrä¹⁰, Ulf Schlichtmann¹¹ and Rocco Jonack¹² ¹CEA LIST, FR; ²CEA-List, FR; ³Infineon Technologies, DE; ⁴Siemens, AT; ⁵Politecnico di Torino, IT; ⁶Università di Bologna, IT; ⁷Robert Bosch GmbH, DE; ⁸Codasip, DE; ⁹Thales Research & Technology, FR; ¹⁰Tampere University, FI; ¹¹TU Munich, DE; ¹²MINRES Technologies GmbH, DE Abstract The TRISTAN project aims to expand and industrialize the European RISC-V ecosystem to compete effectively with existing commercial alternatives. This initiative specifically targets the critical challenges in the development of Electronic Design Automation (EDA) tools, essential for RISC-V-based solutions, by leveraging the synergy between the open source community and industrial solutions. This paper presents an overview of the current landscape of TRISTAN's EDA flow, highlighting specific tools and methodologies that streamline the early design phases of RISC-V-based systems. We explore the unique features of these tools, emphasizing how they complement each other to strengthen the overall design process.
11:25 CEST	MPP01.6	MULTI-PARTNER PROJECT: ENABLING DIGITAL TECHNOLOGIES FOR HOLISTIC HEALTH-LIFESTYLE MOTIVATIONAL AND ASSISTED SUPERVISION SUPPORTED BY ARTIFICIAL INTELLIGENCE (H2TRAIN) Speaker: Juan Antonio Montiel Nelson, Institute for Applied Microelectronics University of Las Palmas de Gran Canaria Las Palmas de G.C., ES Authors: Juan Antonio Nelson¹, Marco Ottella² and Paolo Azzoni³ ¹Universidad de Las Palmas, ES; ²Xtremion, IT; ³INSIDE Industry Association, NL Abstract H2TRAIN aligns with the ECS Strategic Research and Innovation Agenda 2023 (ECS-SRIA), addressing key challenges in integrating digital technologies for health-focused lifestyles through AI-enhanced networks. This project pioneers the use of graphene to develop autonomous biosensors within CMOS technology, supporting advancements in AI-powered health services and IoT applications, covering the entire edge-to-cloud continuum. Beyond digital integration, H2TRAIN innovates in energy detection, collection, and storage, essential for embedding health and sports functions in IoT wearables through smart textile and system integration. The solutions will be rigorously tested and validated with insights from medical, sports, social sciences, and end-user feedback. Focused on remote assisted living, amateur sports training, and post-operative monitoring, H2TRAIN aims to drive innovation in the smart healthcare sector, where investment in semiconductor nanofabrication is limited by the small scale of medical applications.
11:30 CEST	MPP01.7	MULTI-PARTNER PROJECT: DRIVING THE VEHICLE OF THE FUTURE: HOW FEDERATE AND HAL4SDV ARE SHAPING EUROPE'S SOFTWARE-DEFINED VEHICLE ECOSYSTEM Speaker: Michael Paulweber, AVL, AT Authors: Michael Paulweber¹, Andreas Eckel² and Paolo Azzoni³ ¹AVL-Instrumentation and test systems, AT; ²TTTech Computertechnik AG, DE; ³Inside Industry Association, NL Abstract The FEDERATE and HAL4SDV projects aim to address the growing importance of software in the automotive industry, positioning Europe as a leader in the software-defined vehicle (SDV) domain. FEDERATE focuses on building a cohesive European SDV ecosystem by coordinating stakeholders such as OEMs, semiconductor companies, and research institutions. It supports the agile development of non-differentiating software through open-source collaboration, fostering a vibrant SDV community and providing guidance for ongoing and future SDV projects. Meanwhile, HAL4SDV aligns with the EU's Strategic Research and Innovation Agenda to develop technologies and processes needed for SDV advancement beyond 2030. HAL4SDV's objectives include creating a unified software interface, hardware abstraction, and Over-The-Air updates, while focusing on cybersecurity, real-time capabilities, and seamless integration with smart city infrastructure. Together, these projects aim to drive innovation, scalability, and sustainability in the SDV space.
11:31 CEST	MPP01.8	MULTI-PARTNER PROJECT: ARTIFICIAL INTELLIGENCE IN MANUFACTURING LEADING TO SUSTAINABILITY AND THE CONSIDERATION OF HUMAN ASPECTS (AIMS5.0) Speaker: Anouar Nechi, University of Lübeck, DE Authors: Anouar Nechi¹, Yasin Ghafourian², Belal Abu Naim², Thomas Gutt³, Georgios Dimitrakopoulos⁴, Amira Moualhi¹, Mladen Berekovic¹, Pal Varga⁵ and Markus Tauber² ¹University of Lübeck, DE; ²Research Studios Austria, AT; ³Infineon Technologies, DE; ⁴Harokopio University of Athens, GR; ⁵Budapest University of Technology and Economics, HU Abstract The industrial landscape is undergoing a transformative shift towards Industry 5.0, a paradigm characterized by the convergence of sustainability, digital autonomy, and human-centric design. This article focuses on the adoption, enhancement, and implementation of AI-driven hardware, tools, methodologies, and semiconductor technologies in this progression. We present here a comprehensive strategy from the AIMS5.0 project with the objective of connecting academic developments with practical industrial use, fostering a harmonious relationship between humans and machines to improve efficiency, spur innovation, and enhance adaptability. Hence we show here our global vision, and examples of how the creation of AI-based industrial solutions is supported by novel AI-tool chains, advancements in hardware, and tools supporting human aspects.

SD02 Special Day on AI and ML Trends

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium Pasteur

Session chair:
Marc Duranton, CEA, FR

Session co-chair:
Kuan-Hsun Chen, University of Twente, NL

Design of hardware architectures and software, including automatic exploration of large design spaces, assistance of the human designer, resource selection and optimization
Verification of hardware architectures, with topics such as performance prediction, (formal) design validation, accelerating simulations thanks to AI-Augmented Surrogate Models
AI-Accelerated Physical Design and Validation of layout and floorplans
New AI accelerators architectures
Sustainability in AI/ML Development

Time	Label	Presentation Title Authors
11:00 CEST	SD02.1	HOW TO FIT YOUR HUGE DNN ON EDGE DEVICES WITHOUT COMPRESSION? Presenter: Andy Pimentel, University of Amsterdam, NL Author: Andy Pimentel, University of Amsterdam, NL Abstract In this talk, I will address the immense challenges of performing model inference of large Deep Neural Networks (DNNs) on resource-constrained Edge devices. More particularly, I will argue that such DNN model inference is possible, even without the need of popular model compression techniques. To this end, I will explain how one can leverage multiple Edge devices to collaboratively and robustly perform distributed inference of large DNNs.
11:25 CEST	SD02.2	THE NEURONFLOW ARCHITECTURE FOR EDGE AI, OR WHY EVENT-DRIVEN EXECUTION WORKS Presenter: Orlando Moreira, Snapchat, US Author: Orlando Moreira, Snapchat, US Abstract We provide an overview of the GrAIcore Neural Processing Unit (NPU) developed at Snap Inc. for Augmented Reality applications, emphasizing the role of sparsity in achieving significant power and performance improvements compared with traditional architectures. We show how both activation and temporal sparsity can be harvested through specialized training techniques, and how our results may show a way forward for digital event-driven/neuromorphic hardware.
11:50 CEST	SD02.3	EDGE-AI – NEWS FROM THE PAST Presenter: Wolfgang Ecker, TU Munich, DE Author: Wolfgang Ecker, TU Munich, DE Abstract AI has long known challenges as power consumption and privacy protection whereas these side effects partially worsened. This talk discusses how Edge AI, i.e. AI at the interface of the virtual to the real world, can help to reduce the negative impact, and gives some examples on recent research. The talk partially refers to the ACATECH study KI am Endgerät: Wie Edge AI für Datenschutz und Energieeffizienz sorgt to be found at https://www.acatech.de/allgemein/edge-ai/ . Fortunately, recent AI technology exists which can translate the text from German to English language for example the title auto-translates to AI on the Device: How Edge AI Ensures Data Protection and Energy Efficiency.
12:15 CEST	SD02.4	ROUND TABLE WITH THE SPEAKERS, DISCUSSION WITH THE AUDIENCE Presenter: All the Panelists, DATE, FR Author: All the Panelists, DATE, FR Abstract .

TS08 Design Methodologies and Applications for Machine Learning

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 3AB

Session chair:
Paolo Meloni, Università di Cagliari, IT

Session co-chair:
Antonio Miele, Politecnico di Milano, IT

Time	Label	Presentation Title Authors
11:00 CEST	TS08.1	FILTER-BASED ADAPTIVE MODEL PRUNING FOR EFFICIENT INCREMENTAL LEARNING ON EDGE DEVICES Speaker: Jing-Jia Hung, National Taiwan University & TSMC, TW Authors: Jing-Jia Hung¹, Yi-Jung Chen², Hsiang-Yun Cheng³, Hsu Kao⁴ and Chia-Lin Yang¹ ¹National Taiwan University, TW; ²Department of Computer Science and Information Engineering, National Chi Nan University, TW; ³Academia Sinica, TW \| National Taiwan University, TW; ⁴National Tsing Hua University, TW Abstract Incremental Learning (IL) enhances Machine Learning (ML) models over time with new data, ideal for edge devices at the forefront of data collection. However, executing IL on edges faces challenges due to limited resources. Common methods involve IL followed by model pruning or specialized IL methods for edges. However, the former increases training time due to fine-tuning and compromises accuracy for past classes due to limited retained samples or features. Meanwhile, existing edge-specific IL methods utilize weight pruning, which requires specialized hardware or compilers to speed up and cannot reduce computations on general embedded platforms. In this paper, we propose Filter-based Adaptive Model Pruning (FAMP), the first pruning method designed specifically for IL. FAMP prunes the model before the IL process, allowing fine-tuning to occur concurrently with IL, thereby avoiding extended training time. To maintain high accuracy for both new and past data classes, FAMP adapts the compressed model based on observed data classes and retains filter settings from the previous IL iteration to mitigate forgetting. Across all tests, FAMP achieves the best average accuracy, with only a 2.78% accuracy drop over full ML models with IL. Moreover, unlike the common methods that prolong training time, FAMP takes 35% shorter training time on average than using the full ML models for IL.
11:05 CEST	TS08.2	DYLGNN: EFFICIENT LM-GNN FINE-TUNING WITH DYNAMIC NODE PARTITIONING, LOW-DEGREE SPARSITY, AND ASYNCHRONOUS SUB-BATCH Speaker: Zhen Yu, SHANGHAI JIAO TONG UNIVERSITY, CN Authors: zhen yu, Jinhao Li, Jiaming Xu, Shan Huang, Jiancai Ye, Ningyi Xu and Guohao Dai, Shanghai Jiao Tong University, CN Abstract Text-Attributed Graphs (TAGs) tasks involve both textual node information and graph topological structure. The top-k method, using Language Models (LMs) for text encoding and Graph Neural Networks (GNNs) for graph processing, offers the best accuracy while balancing memory and training time. However, challenges still exist: (1) Static sampling of k neighbors reduces performance. Using a fixed k can result in sampling too few or too many nodes, leading to a 3.2% accuracy loss across datasets. (2) Time-consuming processing for non-trainable nodes. After partitioning all nodes into with-gradient trainable and without-gradient non-trainable sets, the number of non-trainable nodes is ∼9-10× larger than trainable nodes, resulting in nearly 70% of the total time. (3) Time-consuming data movement. For processing non-trainable nodes, after the text strings are tokenized into tokens on the CPU side, the data movement from host memory to GPU takes 30%-40% of the time. In this paper, we propose DyLGNN, an efficient end-to-end LM-GNN fine-tuning framework through three innovations: (1) Heuristic Node Partitioning. We propose an algorithm that dynamically and adaptively selects "important" nodes to participate in the training process for downstream tasks. Compared to the static top-k method, we reduce the training memory usage by 24.0%. (2) Low-Degree Sparse Attention. We point out that the embedding of low-degree nodes has minimal impact on the final results (e.g. ∼1.5% accuracy loss), therefore, We perform sparse attention computation on low-degree nodes to further reduce the computation caused by "unimportant" nodes, achieving an average 1.27× speedup. (3) Asynchronous Sub-batch Pipeline. Within the top-k framework, we analyze the time breakdown of the LM inference component. Leveraging our heuristic node partitioning, which effectively minimizes memory demands, we can asynchronously execute data movement and computation, thereby overlapping the time required for data movement. This improves GPU utilization and results in an average 1.1× speedup. We conduct experiments on several common graph datasets, and by combining the three methods mentioned above, DyLGNN achieves a 22.0% reduction in memory usage and a 1.3× end-to-end speedup compared to the top-k strategy.
11:10 CEST	TS08.3	ITERL2NORM: FAST ITERATIVE L2-NORMALIZATION Speaker: ChangMin Ye, Hanyang University, KR Authors: ChangMin Ye, Yonguk Sim, Youngchae Kim, SeongMin Jin and Doo Seok Jeong, Hanyang University, KR Abstract Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock time. Layer normalization is one of the key workloads in the transformer model, following each of multi-head attention and feed-forward network blocks. To reduce data movement, layer normalization needs to be performed on the same chip as the matrix-matrix multiplication engine. To this end, we introduce an iterative L2-normalization method for 1D input (IterL2Norm), ensuring fast convergence to the steady-state solution within five iteration steps and high precision, outperforming the fast inverse square root algorithm in six out of nine cases for FP32 and five out of nine for BFloat16 across the embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the IterL2Norm macro normalizes $d$-dimensional vectors, where 64<=d<=1024, with a latency of 116-227 cycles at 100MHz/1.05V.
11:15 CEST	TS08.4	MPTORCH-FPGA: A CUSTOM MIXED PRECISION FRAMEWORK FOR FPGA-BASED DNN TRAINING Speaker: Sami BEN ALI, Inria Rennes, FR Authors: Sami BEN ALI¹, Silviu-Ioan Filip¹, Olivier Sentieys¹ and Guy Lemieux² ¹INRIA, FR; ²University of British Columbia, CA Abstract Training Deep Neural Networks (DNNs) is compu- tationally demanding, leading to a growing interest in reduced precision formats to enhance hardware efficiency. Several frame- works explore custom number formats with parameterizable pre- cision through software emulation on CPUs or GPUs. However, they lack comprehensive support for different rounding modes and struggle to accurately evaluate the impact of custom precision for FPGA-based targets. This paper introduces MPTorch-FPGA, an extension of the MPTorch framework for performing custom, multi-precision inference and training computations in CPU, GPU, and FPGA environments in PyTorch. MPTorch-FPGA can generate a model-specific accelerator for DNN training , with customizable sizes and arithmetic implementations, providing bit-level accuracy with respect to emulated low precision DNN training on GPUs or CPUs. An offline matching algorithm selects one of several pre-generated (static) FPGA configurations using a custom performance model to estimate latency. To showcase the versatility of MPTorch-FPGA, we present a series of training benchmarks using diverse DNN models, exploring a range of number format configurations and rounding modes. We report both accuracy and hardware performance metrics, verifying the precision of our performance model by comparing estimated and measured latencies across multiple benchmarks. These results highlight the flexibility and practical value of our framework.
11:20 CEST	TS08.5	MEMHD: MEMORY-EFFICIENT MULTI-CENTROID HYPERDIMENSIONAL COMPUTING FOR FULLY-UTILIZED IN-MEMORY COMPUTING ARCHITECTURES Speaker: Do Yeong Kang, Sungkyunkwan University, KR Authors: Do Yeong Kang, Yeong Hwan Oh, Chanwook Hwang, Jinhee Kim, Kang Eun Jeon and Jong Hwan Ko, Sungkyunkwan University, KR Abstract Hyperdimensional Computing (HDC) has shown great potential in brain-inspired computing, but its integration with In-Memory Computing (IMC) faces challenges due to high-dimensional vector operations and memory utilization issues. This paper introduces a novel multi-centroid Associative Memory (AM) structure for HDC implemented on IMC architectures, addressing these challenges while maintaining high accuracy in classification tasks. Our approach compresses dimensions through the multi-centroid model, bringing IMC array utilization for Associative Search close to 100% and significantly reducing computations. This dimension compression substantially decreases memory footprint in both the Encoding Module and Associative Memory, while reducing computational requirements. Additionally, we propose innovative initialization and learning methods for multi-centroid AM, including clustering-based initialization for faster convergence and a quantization-aware iterative learning approach for high-accuracy, IMC-compatible AM training. Our adaptive structure optimizes model design based on available hardware resources by adjusting memory columns and rows. Comprehensive evaluations across various classification datasets demonstrate that our method achieves superior memory efficiency at equivalent accuracy levels and improved accuracy at equivalent memory usage compared to conventional HDC models.
11:25 CEST	TS08.6	ODIN: LEARNING TO OPTIMIZE OPERATION UNIT CONFIGURATION FOR ENERGY-EFFICIENT DNN INFERENCING Speaker: Gaurav Narang, Washington State University, US Authors: Gaurav Narang, Jana Doppa and Partha Pratim Pande, Washington State University, US Abstract ReRAM-based Processing-In-Memory (PIM) architectures enable energy-efficient Deep Neural Network (DNN) inferencing. However, ReRAM crossbars suffer from various non-idealities that affect overall inferencing accuracy. To address that, the matrix-vector-multiplication (MVM) operations are computed by activating a subset of the full crossbar, referred to as Operation Unit (OU). However, OU configurations vary with the neural layers' features such as sparsity, kernel size and their impact on predictive accuracy. In this paper, we consider the problem of learning appropriate layer-wise OU configurations in ReRAM crossbars for unseen DNNs at runtime such that performance is maximized without loss in predictive accuracy. We employ a machine learning (ML) based framework called Odin, which selects the OU sizes for different neural layers as a function of the neural layer features and time-dependent ReRAM conductance drift. Our experimental results demonstrate that the energy-delay-product (EDP) is reduced by up to 8.7× over state-of-the-art homogeneous OU configurations without compromising predictive accuracy.
11:30 CEST	TS08.7	SLIPSTREAM: SEMANTIC-BASED TRAINING ACCELERATION FOR RECOMMENDATION MODELS Speaker: Yassaman Ebrahimzadeh Maboud, University of British Columbia, CA Authors: Yassaman Ebrahimzadeh Maboud¹, Muhammad Adnan¹, Divya Mahajan² and Prashant Jayaprakash Nair¹ ¹University of British Columbia, CA; ²Georgia Tech, US Abstract Recommendation models play a crucial role in delivering accurate and tailored user experiences. However, training such models poses significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non- popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings become redundant, lacking any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. Our experiments demonstrate Slipstream's ability to maintain accuracy while effectively discarding updates to non-varying embeddings. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real- world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.
11:35 CEST	TS08.8	COMPASS: A COMPILER FRAMEWORK FOR RESOURCE-CONSTRAINED CROSSBAR-ARRAY BASED IN-MEMORY DEEP LEARNING ACCELERATORS Speaker: Jihoon Park, Seoul National University, KR Authors: Jihoon Park, Jeongin Choe, Dohyun Kim and Jae-Joon Kim, Seoul National University, KR Abstract Recently, crossbar array based in-memory accelerators have been gaining interest due to their high throughput and energy efficiency. While software and compiler support for the in-memory accelerators has also been introduced, they are currently limited to the case where all weights are assumed to be on-chip. This limitation becomes apparent with the significantly increasing network sizes compared to the in-memory footprint. Weight replacement schemes are essential to address this issue. We propose COMPASS, a compiler framework for resource-constrained crossbar-based processing-in-memory (PIM) deep neural network (DNN) accelerators. COMPASS is specially targeted for networks that exceed the capacity of PIM crossbar arrays, necessitating access to external memories. We propose an algorithm to determine the optimal partitioning that divides the layers so that each partition can be accelerated on chip. Our scheme takes into account the data dependence between layers, core utilization, and the number of write instructions to minimize latency, memory accesses, and improve energy efficiency. Simulation results demonstrate that COMPASS can accommodate much more networks using a minimal memory footprint, while improving throughput by 1.78X and providing 1.28X savings in energy-delay product (EDP) over baseline partitioning methods.
11:40 CEST	TS08.9	OPS: OUTLIER-AWARE PRECISION-SLICE FRAMEWORK FOR LLM ACCELERATION Speaker: Fangxin Liu, Shanghai Jiao Tong University, CN Authors: Fangxin Liu¹, Ning Yang¹, Zongwu Wang¹, Xuanpeng Zhu², Haidong Yao², Xiankui Xiong², Qi Sun³ and Li Jiang¹ ¹Shanghai Jiao Tong University, CN; ²ZTE Corporation, CN; ³Zhejiang University, CN Abstract Large language models (LLMs) have transformed numerous AI applications, with on-device deployment becoming increasingly important for reducing cloud computing costs and protecting user privacy. However, the astronomical model size and limited hardware resources pose significant deployment challenges. Model quantization is a promising approach to mitigate this gap, but the presence of outliers in LLMs reduces its effectiveness. Previous efforts addressed this issue by employing compression-based encoding for mixed-precision quantization. These approaches struggle to balance model accuracy with hardware efficiency due to their value-wise outlier granularity and complex encoding/decoding hardware logic. To address this, we propose OPS (Outlier-aware Precision-Slicing), an acceleration framework that exploits massive sparsity in the higher-order part of LLMs by splitting 16-bit values into a 4-bit/12-bit format. Crucially, OPS introduces an early bird mechanism that leverages the high-order 4-bit computation to predict the importance of the full calculation result. This mechanism enables efficient computational skips by continuing execution only for important computations and using preset values for less significant ones. This scheme can be efficiently integrated with existing hardware accelerators like systolic arrays without complex encoding/decoding. As a result, OPS outperforms state-of-the-art outlier-aware accelerators, achieving a $1.3-4.3 imes$ performance boost and $14.3-66.7\%$ greater energy efficiency, with minimal model accuracy loss. This approach enables more efficient on-device LLM deployment, effectively balancing computational efficiency and model accuracy.
11:41 CEST	TS08.10	OPENC2: AN OPEN-SOURCE END-TO-END HARDWARE COMPILER DEVELOPMENT FRAMEWORK FOR DIGITAL COMPUTE-IN-MEMORY MACRO Speaker: TIANCHU DONG, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Tianchu Dong, Shaoxuan Li, Yihang Zuo, Hongwu Jiang, Yuzhe Ma and Shanshi Huang, The Hong Kong University of Science and Technology (Guangzhou), CN Abstract Digital Compute-in-Memory (DCIM), which inserts logic circuits into SRAM arrays, presents a significant advancement in CIM architecture. DCIM has shown great potential in applications, and the diversity of applications requires rapid hardware iteration. However, the hardware design flow from user specifications to layout is extremely tedious and time-consuming for manual design. Commercial EDA tools are limited by restrictive licenses and the inability to specifically optimize the datapath, which calls for an open-source end-to-end hardware compiler for DCIM. This paper proposes OpenC2 , the first open-source end-to-end development framework for DCIM macro compilation. OpenC2 provides a template-based generation platform for DCIM macros across various technologies, sizes, and configurations. It can automatically generate a datapath-optimized, compact DCIM macro layout based on a hierarchical physical design methodology. Our experiment results show that OpenC2's compact design on FreePDK45 delivers over 30% area reduction and over 40% improvement in area efficiency compared to AutoDCIM on TSMC40.
11:42 CEST	TS08.11	SPEEDING-UP SUCCESSIVE READ OPERATIONS OF STT-MRAM VIA READ PATH ALTERNATION FOR DELAY SYMMETRY Speaker: Taehwan Kim, Korea University, KR Authors: Taehwan Kim and Jongsun Park, Korea University, KR Abstract Recent research on data-intensive computing systems has demonstrated that system throughput and latency are critically dependent on memory read bandwidth, highlighting the need for fast memory read operations. Although spin-transfer torque magnetic random-access memory (STT-MRAM) has emerged as a promising alternative to CMOS-based embedded memories, STT-MRAM continues to face challenges related to read speed and energy efficiency. This paper introduces a novel read scheme that enhances read speed and energy in successive read operations by alternating read paths between data and reference cells. This approach effectively mitigates worst-case read scenarios by balancing the read voltage swings. HSPICE simulations using 28nm CMOS technology show a 31.5% improvement in read speed and 48.8% reduction in energy consumption compared to the previous approach. SCALE-Sim system simulations also demonstrate that applying the proposed read scheme to STT-MRAM embedded memories in AI accelerators shows a significant reduction in memory energy for CNN inference tasks compared to the SRAM embedded memory.

TS09 Low-power, energy-efficient and thermal-aware design

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: St Clair 3AB

Session chair:
Grace Li Zhang, TU Darmstadt, DE

Session co-chair:
Yu-Guang Chen, National Central University, TW

Time	Label	Presentation Title Authors
11:00 CEST	TS09.1	CO-UP: COMPREHENSIVE CORE AND UNCORE POWER MANAGEMENT FOR LATENCY-CRITICAL WORKLOADS Speaker: Ki-Dong Kang, Electronics and Telecommunications Research Institute, KR Authors: Ki-Dong Kang¹, Gyeongseo Park¹ and Daehoon Kim² ¹Electronics and Telecommunications Research Institute, KR; ²Yonsei University, KR Abstract Improving energy efficiency to reduce costs in server environments has attracted considerable attention. Considering that processors account for a significant portion of energy consumption in servers, Dynamic Voltage and Frequency Scaling (DVFS) enhances their energy efficiency by adjusting the operational speed and power consumption of processors. Additionally, modern high-end processors extend DVFS functionality not only to core components but also to uncore parts. This is because the increasing complexity and integration of System on Chips (SoCs) have highlighted the substantial energy consumption. However, existing uncore voltage/frequency scaling fails to effectively consider Latency-Critical (LC) applications, leading to sub-optimal energy efficiency or degraded performance. In this paper, we introduce Co-UP, power management that simultaneously scales core and uncore frequencies for latency-critical applications, designed to improve energy efficiency without violating Service Level Objectives (SLOs). To this end, Co-UP incorporates a prediction model that estimates outcomes of energy consumption and performance as uncore and core frequency changes. Based on the estimated gains, Co-UP adjusts to uncore and/or core frequencies to further enhance energy efficiency or performance. This predictive model can rapidly adapt to new and unlearned loads, enabling Co-UP to operate online without any prior profiling. Our experiments show that Co-UP can reduce energy consumption by up to 28.2% compared to existing Intel's policy and up to 17.6% compared to state-of-the-art power management studies, without SLO violations.
11:05 CEST	TS09.2	FLEXIBLE THERMAL CONDUCTANCE MODEL (TCM) FOR EFFICIENT THERMAL SIMULATION OF 3-D ICS AND PACKAGES Speaker: Shunxiang Lan, Shanghai Jiao Tong University, CN Authors: Shunxiang Lan, Min Tang and Jun Ma, Shanghai Jiao Tong University, CN Abstract Thermal management plays an increasingly important role in the design of 3-D integrated circuits (ICs) and packages. To deal with the related thermal issues, efficient and accurate evaluation of the thermal performance is obviously essential. In this paper, an efficient approach with the flexible thermal conductance model (TCM) is presented for thermal simulation of 3-D ICs and packages. Firstly, the entire structure is partitioned and classified into two kinds of regions, named region of interest (ROI) and region of fixity (ROF). The ROI usually contains the key components in thermal designs while the ROF holds invariant thermal characteristics. Then, in order to represent the thermal impact of ROF on ROI, a novel technique based on the TCM is developed, which can be treated as the equivalent boundary condition of the ROI. By this means, the solution domain of the whole system is constrained to the ROI, which results in significant reduction of computational costs. Furthermore, in the representation of ROF, a flexible TCM with elegant rational expressions on the heat convection coefficient is proposed to deal with varying boundary conditions, which greatly expands the applicability of this method. The validity and efficiency of the proposed method is illustrated by the numerical examples, where a 138x speedup is achieved comparing with the commercial software.
11:10 CEST	TS09.3	THANOS: ENERGY-EFFICIENT KEYWORD SPOTTING PROCESSOR WITH HYBRID TIME-FEATURE-FREQUENCY-DOMAIN ZERO-SKIPPING Speaker: Sangyeon Kim, Sogang University, KR Authors: Sangyeon Kim, Hyunmin Kim and Sungju Ryu, Sogang University, KR Abstract In recent years, the keyword spotting algorithm has gained significant attention for applications such as personalized virtual assistants. However, the keyword spotting system must be always turned on to listen to the input voice for the recognition, which worsens the battery constraint problem in the edge devices. In this paper, we first analyze the sparsities in the keyword spotting computation. Based on the characteristic, we introduce the keyword spotting processor called Thanos to enable the zero-skipping scheme in the multiple keyword spotting domains to mitigate the burdensome energy consumption. Experimental results show that our hybrid-domain zero-skipping scheme reduces the latency and the energy consumption by 80.3-87.4% and 48.1-79.8%, respectively, over the baseline architecture.
11:15 CEST	TS09.4	ALGORITHM-HARDWARE CO-DESIGN OF A UNIFIED ACCELERATOR FOR NON-LINEAR FUNCTIONS IN TRANSFORMERS Speaker: Haonan Du, Zhejiang University, CN Authors: Haonan Du¹, Chenyi Wen¹, Zhengrui Chen¹, Li Zhang², Qi Sun¹, Zheyu Yan¹ and Cheng Zhuo¹ ¹Zhejiang University, CN; ²Hubei University of Technology, CN Abstract Non-linear functions (NFs) in Transformers require high-precision computation consuming significant time and energy, despite the aggressive quantization schemes for other components. Piece-wise Linear (PWL) approximation-based methods offer more efficient processing schemes for NFs but fall short in dealing with functions with high non-linearities. Moreover, PWL-based methods still suffer from inevitably high latency introduced by the Multiply-And-Add (MADD) unit. To address these issues, this paper proposes a novel quadratic approximation scheme and a highly integrated, multiplier-less hardware structure, as a unified method to accelerate any unary non-linear function. We also demonstrate implementation examples for GELU, Softmax, and LayerNorm. The experimental results show that the proposed method achieves up to 5.41% higher inference accuracy and 60.12% lower area-delay product.
11:20 CEST	TS09.5	EFFICIENT HOLD BUFFER OPTIMIZATION BY SUPPLY NOISE-AWARE DYNAMIC TIMING ANALYSIS Speaker: Lishuo Deng, Southeast University, CN Authors: Lishuo Deng, Changwei Yan, Cai Li, Zhuo Chen and Weiwei Shan, Southeast University, CN Abstract As the CMOS process scales down, digital circuits become more susceptible to hold time violations due to increased sensitivity to supply voltage fluctuations. Since hold time violation is fatal, sufficient hold fixing buffers need to be inserted into the short paths to prevent it. However, by assuming a constant power supply level, traditional hold fixing causes imprecise and overly conservative timing analysis and hence leads to circuit overhead and degraded performance. To address this, we propose a power supply noise (PSN)-aware dynamic timing analysis for realistic hold time analysis and efficient hold buffer optimization, which integrates a machine learning-based timing model into the conventional design flow. Building on the highly effective application of the Weibull cumulative distribution function and machine learning for dynamic PSN-aware timing analysis, we propose introducing an additional parameter for PSN amplitude, which has a significant impact on delay, and narrowing the overall parameter range using real PSN waveforms extracted from the RedHawk. This approach achieves a prediction error of only 3.45% for cell delay and 5.1% for path delay, while also reducing dataset acquisition costs. To the best of our knowledge, this work is the first to apply PSN-aware dynamic timing analysis specifically for hold optimization, mitigating the pessimism of traditional static timing analysis (STA) and effectively minimizing redundant hold fixing buffers while remaining compatible with existing design workflows. Since short paths often overlap with critical paths, reducing redundant hold buffers not only decreases area overhead but also enhances performance. Applied to a 22 nm, 64-point Fast Fourier Transform (FFT) circuit, our EDA compatible method combined with a greedy algorithm reduces hold buffers by 55%, achieving not only 6.79% circuit area reduction but also 8.1% performance improvement due to the elimination of redundant buffers in short and critical paths.
11:25 CEST	TS09.6	LARED: EFFICIENT IR DROP PREDICTOR WITH LAYOUT-PRESERVING REBUILDER-ENCODER-DECODER ARCHITECTURE Speaker: Zhou Jin, SSSLab, Dept. of CST, China University of Petroleum-Beijing, China, CN Authors: ChengXuan Yu¹, YanShuang Teng¹, WenHao Dai¹, YongJiang Li¹, Wei Xing², Xiao Wu³, Dan Niu⁴ and Zhou Jin⁵ ¹Super Scientific Software Laboratory, University of Petroleum-Beijing, CN; ²The University of Sheffield, GB; ³Huada Empyrean Software Co.Ltd, CN; ⁴Southeast University, CN; ⁵Zhejiang University, CN Abstract In the realm of integrated circuit verification, IR drop analysis plays a crucial role. Recent advancements in machine learning (ML) significantly enhance its efficiency, yet many current approaches fail to fully leverage the input structure of feature maps and the transmission mechanism of Power Delivery Network (PDN) layouts. To bridge these gaps, we introduce Layout-Preserving Rebuilder-Encoder-Decoder Architecture Predictor (LaRED), which employs a novel Rebuilder-Encoder-Decoder (RED) architecture and utilizes an innovative downsampling approach and upsampling framework to optimize its perception of instances and the transmission of features. LaRED captures information from various regions with asymmetric topological structure while preserving and transferring layout characteristics through deformable convolution, hybrid downsampling, cascaded upsampling, and attentional feature fusion. The rebuilder rebuilds raw input, whereas the encoder ensures comprehensive feature transmission across all instances. The decoder then facilitates seamless transfer of feature information across layers. This approach enables LaRED to integrate chip features of varying topologies and scales, enhancing its representational power. Compared to the current State-Of-The-Art (SOTA), MAUnet, LaRED achieves accuracy improvements of 34.6\% to 42.6\% in benchmark tests, establishing it as the new standard in static IR drop analysis for integrated circuit design with ML techniques. The code is available at https://github.com/Todi85/LaRED.
11:30 CEST	TS09.7	COOL3D: COST-OPTIMIZED AND EFFICIENT LIQUID COOLING FOR 3D INTEGRATED CIRCUITS Speaker: Jing Li, Beihang University, CN Authors: Jing Li¹, Bingrui Zhang¹, Yuquan Sun¹, Wei Xing² and Yuanqing Cheng¹ ¹Beihang University, CN; ²The University of Sheffield, GB Abstract CMOS scaling faces challenges due to lithography and device physics issues, leading to increased costs and difficulties in expanding chip footprint. 3D integration technology offers increased integration density without increasing footprint, but elevated power density makes heat dissipation a significant challenge. Microchannel cooling effectively removes heat inside 3D chips. Traditional microchannel optimizations typically focus only on minimizing pump power within a limited parameter design space, leading to suboptimal cooling efficiency. Moreover, existing research rarely considers manufacturing costs, limiting practical application. To address these issues, we propose a high-dimensional non-uniform microchannel design scheme based on Segmented Sampling Bayesian Optimization (SSBO). This multi-parameter collaborative optimization framework comprehensively optimizes microchannel design. Our method reduces pump power by 70% compared to limited parameter design spaces. Additionally, we introduce a cost model for microchannel design, formulating a multi-objective optimization problem that considers both manufacturing cost and pump power consumption. By solving the multi-objective optimization problem by searching for the Pareto front, we demonstrate a balanced design between microchannel manufacturing cost and pump power and provide guidelines for key design parameters.
11:35 CEST	TS09.8	JOINT DNN PARTITION AND THREAD ALLOCATION OPTIMIZATION FOR ENERGY-HARVESTING MEC SYSTEMS Speaker: Yizhou Shi, Nanjing University of Science and Technology, CN Authors: Yizhou Shi, Liying Li, Yue Zeng, Peijin Cong and Junlong Zhou, Nanjing University of Science and Technology, CN Abstract Deep neural networks (DNNs) have demonstrated exceptional performance, leading to diverse applications across various mobile devices (MDs). Considering factors like portability and environmental sustainability, an increasing number of MDs are adopting energy harvesting (EH) techniques for power supply. However, the computational intensity of DNNs presents significant challenges for their deployment on these resource-constrained devices. Existing approaches often employ DNN partition or offloading to mitigate the time and energy consumption associated with running DNNs on MDs. Nonetheless, existing methods frequently fall short in accurately modeling the execution time of DNNs, and do not consider to use thread allocation for further latency and energy consumption optimization. To solve these problems, we propose a dynamic DNN partition and thread allocation method to optimize the latency and energy consumption of running DNNs on EH-enabled MDs. Specifically, we first investigate the relationship between DNN inference latency and allocated threads and establish an accurate DNN latency prediction model. Based on the prediction model, a DRL-based DNN partition (DDP) algorithm is designed to find the optimal partitions for DNNs. A thread allocation (TA) algorithm is proposed to reduce the inference latency. Experimental results from our test-bed platform demonstrate that compared to four benchmarking methods, our scheme can reduce DNN inference latency and energy consumption by up to 37.3% and 38.5%.
11:40 CEST	TS09.9	FAST DYNAMIC IR-DROP PREDICTION WITH DUAL-PATH SPATIAL-TEMPORAL ATTENTION Speaker: Bangqi Fu, The Chinese University of Hong Kong, HK Authors: Bangqi Fu, Lixin Liu, Qijing Wang, Yutao Wang, Martin Wong and Evangeline Young, The Chinese University of Hong Kong, HK Abstract The analysis of IR-drop stands as a fundamental step in optimizing the power distribution network (PDN), and subsequently influences the design performance. However, traditional IR-drop analysis using commercial tools proves to be exceedingly time-consuming. Fast and accurate IR-drop analysis is desperately in demand to achieve high performance on timing and power. Recently, machine learning approaches have garnered attention owing to their remarkable speed and extensibility in IC designs. However, prior works for dynamic IR-drop prediction presented limited performance since they did not exploit the time-varying activities. In this paper, we proposed a dual-path model with spatial-temporal transformers to extract the static spatial features and dynamic time-variant activities for dynamic IR drop prediction. Experimental results on the large-scale advanced dataset CircuitNet show that our model significantly outperforms the state-of-the-art works.
11:45 CEST	TS09.10	A NOVEL FREQUENCY-SPATIAL DOMAIN AWARE NETWORK FOR FAST THERMAL PREDICTION IN 2.5D ICS Speaker: Dan Niu, Southeast University, CN Authors: Dekang Zhang¹, Dan Niu¹, Zhou Jin², Yichao Dong¹, Jingweijia Tan³ and Changyin Sun⁴ ¹Southeast University, CN; ²Zhejiang University, CN; ³Jilin University, CN; ⁴Anhui University, CN Abstract In the post-Moore era, 2.5D chiplet-based ICs present significant challenges in thermal management due to increased power density and thermal hotspots. Neural network-based thermal prediction models can perform real-time predictions for many unseen new designs. However, existing CNN-based and GCN-based methods cannot effectively capture the global thermal features, especially for high-frequency components, hindering prediction accuracy enhancement. In this paper, we propose a novel frequency- spatial dual domain aware prediction network (FSA-Heat) for fast and high-accuracy thermal prediction in 2.5D ICs. It integrates high-to-low frequency and spatial domain encoder (FSTE) module with frequency domain cross-scale interaction module (FCIFormer) to achieve high-to-low frequency and global-to-local thermal dissipation feature extraction. Additionally, a frequency-spatial hybrid loss (FSL) is designed to effectively attenuate high-frequency thermal gradient noises and spatial misalignments. The experimental results show that the performance enhancements offered by our proposed method are substantial, outperforming the newly-proposed 2.5D method, GCN+PNA, by considerable margins (over 99% RMSE reduction, 4.23X inference speedup). Moreover, extensive experiments demonstrate that FSA-Heat also exhibits robust generalization capabilities.

TS10 Applications of Artificial Intelligence Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Salon Pasteur

Session chair:
Graziano Pravadelli, Univerità di Verona, IT

Session co-chair:
Oliver Bringmann, Tuebingen University, DE

Time	Label	Presentation Title Authors
11:00 CEST	TS10.1	TAIL: EXPLOITING UNDERLINE{T}EMPORAL UNDERLINE{A}SYNCHRONOUS EXECUTION FOR EFFICIENT SPIKING NEURAL NETWORKS WITH UNDERLINE{I}NTER-UNDERLINE{L}AYER PARALLELISM Speaker: Haomin Li, Shanghai Jiao Tong University, CN Authors: Haomin Li¹, Fangxin Liu¹, Zongwu Wang¹, Dongxu Lyu¹, Shiyuan Huang¹, Ning Yang¹, Qi Sun², Zhuoran Song¹ and Li Jiang¹ ¹Shanghai Jiao Tong University, CN; ²Zhejiang University, CN Abstract Spiking neural networks (SNNs) are an alternative computational paradigm to artificial neural networks (ANNs) that have attracted attention due to their event-driven execution mechanisms, enabling extremely low energy consumption. However, the existing SNN execution model, based on software simulation or synchronized hardware circuitry, is incompatible with the event-driven nature, thus resulting in poor performance and energy efficiency. The challenge arises from the fact that neuron computations across multiple time steps result in increased latency and energy consumption. To overcome this bottleneck and leverage the full potential of SNNs, we propose TAIL, a pioneering temporal asynchronous execution mechanism for SNNs driven by a comprehensive analysis of SNN computations. Additionally, we propose an efficient dataflow design to support SNN inference, enabling concurrent computation of various time steps across multiple layers for optimal Processing Element (PE) utilization. Our evaluations show that TAIL greatly improves the performance of SNN inference, achieving a $6.94 imes$ speedup and a $6.97 imes$ increase in energy efficiency on current SNN computing platforms.
11:05 CEST	TS10.2	EXPLOITING BOOSTING IN HYPERDIMENSIONAL COMPUTING FOR ENHANCED RELIABILITY IN HEALTHCARE Speaker: Sungheon Jeong, University of california, irvine, US Authors: SungHeon Jeong¹, Hamza Errahmouni Barkam¹, Sanggeon Yun¹, Yeseong Kim², Shaahin Angizi³ and Mohsen Imani¹ ¹University of California, Irvine, US; ²DGIST, KR; ³New Jersey Institute of Technology, US Abstract Hyperdimensional computing (HDC) enables efficient data encoding and processing in high-dimensional spaces, benefiting machine learning and data analysis. However, underutilization of these spaces can lead to overfitting and reduced model reliability, especially in data-limited systems—a critical issue in sectors like healthcare that demand robustness and consistent performance. We introduce BoostHD, an approach that applies boosting algorithms to partition the hyperdimensional space into subspaces, creating an ensemble of weak learners. By integrating boosting with HDC, BoostHD enhances performance and reliability beyond existing HDC methods. Our analysis highlights the importance of efficient utilization of hyperdimensional spaces for improved model performance. Experiments on healthcare datasets show that BoostHD outperforms state-of-the-art methods. On the WESAD dataset, it achieved an accuracy of 98.37 ± 0.32%, surpassing Random Forest, XGBoost, and OnlineHD. BoostHD also demonstrated superior inference efficiency and stability, maintaining high accuracy under data imbalance and noise. In person-specific evaluations, it achieved an average accuracy of 96.19%, outperforming other models. By addressing the limitations of both boosting and HDC, BoostHD expands the applicability of HDC in critical domains where reliability and precision are paramount.
11:10 CEST	TS10.3	LCACHE: LOG-STRUCTURED SSD CACHING FOR TRAINING DEEP LEARNING MODELS Speaker: Shucheng Wang, China Mobile (Suzhou) Software Technology, CN Authors: Shucheng Wang¹, Zhiguo Xu¹, Zhandong Guo¹, Jian Sheng², Kaiye Zhou¹ and Qiang Cao³ ¹China Mobile (Suzhou) Software Technology Co., Ltd., CN; ²Suzhou City Univercity, CN; ³Huazhong University of Science and Technology, CN Abstract Training deep learning models is computational and data-intensive. Existing approaches utilize local SSDs within training servers to cache datasets, thereby accelerating data loading during model training. However, we experimentally observe that data loading remains a performance bottleneck when randomly retrieving small-sized sample files on SSDs. In this paper, we introduce LCache, a log-structured dataset caching mechanism designed to fully leverage the I/O capabilities of SSDs and reduce I/O-induced training stalls. LCache determines the randomized dataset access order by extracting the pseudo-random seed from the training frameworks. It then aggregates small-sized sample files into larger chunks and stores them in a log file on SSDs, thus enabling sequential I/O requests on data retrieval and improving data loading throughput. Furthermore, LCache proposes a real-time log reordering mechanism that strategically schedules cached data to organize logs across different epochs, which enhances cache utilization and minimizes data retrieval from low-performance remote storage systems. Additionally, LCache incorporates an MetaIndex to enable rapid log traversal and querying. We evaluate LCache with various real-world DL models and datasets. LCache outperforms the native PyTorch Dataloader and NoPFS by up to 9.4x and 7.8x in throughput, respectively.
11:15 CEST	TS10.4	OLORAS: ONLINE LONG RANGE ACTION SEGMENTATION FOR EDGE DEVICES Speaker: Filippo Ziche, Università di Verona, IT Authors: Filippo Ziche and Nicola Bombieri, Università di Verona, IT Abstract Temporal action segmentation (TAS) is essential for identifying when actions are performed by a subject, with applications ranging from healthcare to Industry 5.0. In such contexts, the need for real-time, low-latency responses and privacy-aware data handling often requires the use of edge devices, despite their limited memory, power, and computational resources. This paper presents OLORAS, a novel TAS model designed for real-time performance on edge devices. By leveraging human pose data instead of video frames and employing linear recurrent units (LRUs), OLORAS efficiently processes long sequences while minimizing memory usage. Tested on the standard Assembly101 dataset, the model outperforms state-of-the-art TAS methods in accuracy with 10x memory footprint reduction, making it well-suited for deployment on resource-constrained devices.
11:20 CEST	TS10.5	ONLINE LEARNING FOR DYNAMIC STRUCTURAL CHARACTERIZATION IN ELECTRON ENERGY LOSS SPECTROSCOPY Speaker: Lakshmi Varshika Mirtinti, Drexel University, US Authors: Lakshmi Varshika M¹, Jonathan Hollenbach², Nicolas Agostini³, Ankur Limaye³, Antonino Tumeo⁴ and Anup Das¹ ¹Drexel University, US; ²Johns Hopkins University, US; ³Pacific Northwest National Lab, US; ⁴Pacific Northwest National Laboratory, US Abstract In-situ Electron Energy Loss Spectroscopy (EELS) is a crucial technique for determining the elemental composition of materials through EELS Spectrum Images (EELS-SI). While recent innovations have made it possible for EELS-SI data acquisition at rates of 400 frames per second with near-zero read noise, the challenge lies in processing this massive stream of real-time data to capture nanoscale dynamic changes. This task demands advanced machine learning methods capable of identifying subtle and complex features in EELS spectra. Furthermore, the EELS data acquired in difficult experimental conditions often suffer from a low signal-to-noise ratio (SNR), leading to unreliable classification and limiting their utility. In response to this critical need, we introduce a spiking neural network (SNN)-based Variational Autoencoder (VAE) that embeds spectral data into a latent space, facilitating precise prediction of structural changes. VAEs are designed to learn efficient low-dimensional representations while capturing the inherent variability in the data, making them highly effective for processing multi-dimensional data. Additionally, SNNs, which use biological neurons, offer unmatched scalability and energy efficiency by processing information through binary spikes, making them ideal for high-throughput data. We validate our framework using MXene annealing data, achieving denoised spectrum images with an SNR of 28.3dB. For the first time, we present a fully online learning solution for dynamic structural tracking, implemented directly in hardware, eliminating the traditional bottleneck of offline training. Our method achieves reliable, realtime, on-device characterization of high-speed EELS data when evaluated on an FPGA platform. Joint experiments with the SNN-VAE model on both spiking autoencoder hardware and a software-trained hybrid configuration of hardware spiking encoders demonstrated latency reductions of 25.2×, 93.7×, and 1.04×, 4.5× in energy savings, respectively, compared to baseline.
11:25 CEST	TS10.6	SCALES: BOOST BINARY NEURAL NETWORK FOR IMAGE SUPER-RESOLUTION WITH EFFICIENT SCALINGS Speaker: Renjie Wei, Peking University, CN Authors: Renjie Wei¹, Zechun Liu², Yuchen Fan³, Runsheng Wang¹, Ru Huang¹ and Meng Li¹ ¹Peking University, CN; ²Meta Inc, US; ³Meta, US Abstract Deep neural networks for image super-resolution (SR) have demonstrated superior performance. However, the large memory and computation consumption hinders their deployment on resource-constrained devices. Binary neural networks (BNNs), which quantize the floating point weights and activations to 1- bit can significantly reduce the cost. Although BNNs for image classification have made great progress these days, existing BNNs for SR still suffer from a large performance gap between the FP SR networks. To this end, we observe the activation distribution in SR networks and find much larger pixel-to-pixel, channel-tochannel, layer-to-layer, and image-to-image variation in the activation distribution than image classification networks. However, existing BNNs for SR fail to capture these variations that contain rich information for image reconstruction, leading to inferior performance. To address this problem, we propose SCALES, a binarization method for SR networks that consists of the layerwise scaling factor, the spatial re-scaling method, and the channelwise re-scaling method, capturing the layer-wise, pixel-wise, and channel-wise variations efficiently in an input-dependent manner. We evaluate our method across different network architectures and datasets. For CNN-based SR networks, our binarization method SCALES outperforms the prior art method by 0.2dB with fewer parameters and operations. With SCALES, we achieve the first accurate binary Transformer-based SR network, improving PSNR by more than 1dB compared to the baseline method.
11:30 CEST	TS10.7	POROS: ONE-LEVEL ARCHITECTURE-MAPPING CO-EXPLORATION FOR TENSOR ALGORITHMS Speaker: Fuyu Wang, Sun Yat-sen University, CN Authors: Fuyu Wang and Minghua Shen, Sun Yat-sen University, CN Abstract Tensor algorithms increasingly rely on specialized accelerators to meet growing performance and efficiency demands. Given the rapid evolution of these algorithms and the high cost of designing accelerators, automated solutions for jointly optimizing both architectures and mappings have gained attention. However, the joint design space is non-convex and non-smooth, hindering the finding of optimal or near-optimal designs. Moreover, prior work conducts two-level exploration, resulting in a combinatorial explosion. In this paper, we propose Poros, a one-level architecture-mapping co-exploration framework. Poros directly explores a batch of architecture-mapping configurations and evaluates their performance. It then exploits reinforcement learning to perform gradient-based search in the non-smooth joint design space. By sampling from the policy, Poros keeps exploring new actions to address non-convexity. Experimental results demonstrate that Poros achieves up to 5.32$ imes$ and 2.15$ imes$ better EDP compared with hand-designed accelerators and state-of-the-art automatic approaches respectively. Through one-level exploration scheme, Poros also converges at least 20\% faster than other approaches.
11:35 CEST	TS10.8	A CNN COMPRESSION METHODOLOGY FOR LAYER-WISE RANK SELECTION CONSIDERING INTER-LAYER INTERACTIONS Speaker: Milad Kokhazadeh, School of Informatics, Aristotle University of Thessaloniki, GR Authors: Milad Kokhazadeh¹, Georgios Keramidas², Vasilios Kelefouras³ and Iakovos Stamoulis⁴ ¹PhD Candidate, Aristotle University of Thessaloniki, GR; ²Aristotle University of Thessaloniki/Think Silicon S.A., GR, GR; ³University of Plymouth, GB; ⁴Think Silicon, S.A. An Applied Materials Company, GR Abstract Convolutional Neural Networks (CNNs) achieve state-of-the-art performance across various application domains but are often resource-intensive, limiting their use on resource-constrained devices. Low-rank factorization (LRF) has emerged as a promising technique to reduce the computational complexity and memory footprint of CNNs, enabling efficient deployment without significant performance loss. However, challenges still remain in optimizing the rank selection problem, balancing memory reduction and accuracy, and integrating LRF into the training process of CNNs. In this paper, a novel and generic methodology for layer-wise rank selection is presented, considering inter-layer interactions. Our approach is compatible with any decomposition method and does not require additional retraining. The proposed methodology is evaluated in thirteen widely-used, CNN models, significantly reducing model parameters and Floating-Point Operations (FLOPs). In particular, our approach achieves up to a 94.6% parameter reduction (82.3% on average) and up to 90.7% FLOPs reduction (59.6% on average), with less than a 1.5% drop in validation accuracy, demonstrating superior performance and scalability compared to existing techniques.
11:40 CEST	TS10.9	FINEQ: SOFTWARE-HARDWARE CO-DESIGN FOR LOW-BIT FINE-GRAINED MIXED-PRECISION QUANTIZATION OF LLMS Speaker: Xilong Xie, Beihang University, CN Authors: Xilong Xie¹, Liang Wang¹, Limin Xiao¹, Meng Han¹, Lin Sun², Shuai Zheng¹ and Xiangrong Xu¹ ¹Beihang University, CN; ²Jiangsu Shuguang Optoelectric Co., Ltd., CN Abstract Large language models (LLMs) have significantly advanced the natural language processing paradigm but impose substantial demands on memory and computational resources. Quantization is one of the most effective ways to reduce memory consumption of LLMs. However, advanced single-precision quantization methods experience significant accuracy degradation when quantizing to ultra-low bits. Existing mixed-precision quantization methods are quantized by groups with coarse granularity. Employing high precision for group data leads to substantial memory overhead, whereas low precision severely impacts model accuracy. To address this issue, we propose FineQ, software-hardware co-design for low-bit fine-grained mixed-precision quantization of LLMs. First, FineQ partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters, thus achieving a balance between model accuracy and memory overhead. Then, we propose an outlier protection mechanism within clusters that uses 3 bits to represent outliers and introduce an encoding scheme for index and data concatenation to enable aligned memory access. Finally, we introduce an accelerator utilizing temporal coding that effectively supports the quantization algorithm while simplifying the multipliers in the systolic array. FineQ achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width. Meanwhile, the accelerator achieves up to 1.79× energy efficiency and reduces the area of the systolic array by 61.2\%.
11:45 CEST	TS10.10	SOLVING THE COLD-START PROBLEM FOR THE EDGE: CLUSTERING AND ADAPTIVE DEEP LEARNING FOR EMOTION DETECTION Speaker: Junjiao Sun, Centro de Electrónica Industrial, Universidad Politécnica de Madrid, ES Authors: Junjiao Sun¹, Laura Gutierrez Martin², Jose Miranda Calero³, Celia López-Ongil², Jorge Portilla¹ and Jose Andres Otero Marnotes¹ ¹Centro de Electrónica Industrial Universidad Politecnica de Madrid, ES; ²UC3M (Universidad Carlos III de Madrid), ES; ³EPFL, CH Abstract Designing AI-based applications personalized to each user's behavior presents significant challenges due to the cold start problem and the impracticality of extensive individual data labeling. These challenges are further compounded when deploying such applications at the edge, where limited computing resources constrain the design space. This paper introduces a novel approach to AI-driven personalized solutions in biosensing applications by combining deep learning with clustering-based separation techniques. The proposed Clustering and Learning for Emotion Adaptive Recognition (CLEAR) methodology strikes a balance between population-wide models and fully personalized systems by leveraging data-driven clustering. CLEAR demonstrates its effectiveness in emotion recognition tasks, and its integration with fine-tuning enables efficient deployment on edge devices, ensuring data privacy and real-time detection when new users are introduced to the system. We conducted experiments for model personalization on two edge computing platforms: the Coral Edge TPU Dev Board and the Raspberry Pi with an Intel Movidius Neural Compute Stick 2. The results show that initial cluster assignment for new users can be achieved without labeled data, directly addressing the cold-start problem. Compared to baseline validation without clustering, this proposal improves accuracy metric from 75% to 81.9%. Furthermore, fine-tuning with minimal labeled data significantly improves accuracy, achieving up to 86.34% for the fear detection task in the WEMAC dataset while remaining suitable for deployment on resource-constrained edge devices.
11:50 CEST	TS10.11	KALMMIND: A CONFIGURABLE KALMAN FILTER DESIGN FRAMEWORK FOR EMBEDDED BRAIN-COMPUTER INTERFACES Speaker: Guy Eichler, Columbia University, Department of Computer Science, IL Authors: Guy Eichler, Joseph Zuckerman and Luca Carloni, Columbia University, US Abstract Kalman Filter (KF) is one of the most prominent algorithms to predict motion from measurements of brain activity. However, little effort has been made to optimize the KF for deployment in embedded brain-computer interfaces (BCIs). To address this challenge, we propose a new framework for designing KF hardware accelerators specialized for BCI, which facilitates design-space exploration by providing a tunable balance between latency and accuracy. Through FPGA-based experiments with brain data, we demonstrate improvements in both latency and accuracy compared to the state of the art.
11:51 CEST	TS10.12	SEGTRANSFORMER: ENHANCING SOFTMAX PERFORMANCE THROUGH SEGMENTATION WITH A RERAM-BASED PIM ACCELERATOR Speaker: Ing-Chao Lin, National Cheng Kung University, TW Authors: YuCheng Wang¹, Ing-Chao Lin¹ and Yuan-Hao Chang² ¹National Cheng Kung University, TW; ²Academia Sinica, TW \| National Taiwan University, TW Abstract To accelerate Transformer computations, numerous ReRAM-based Processor-In-Memory (PIM) architectures have been proposed, which effectively speed up matrix multiplication. However, these approaches often shift the performance bottleneck from the attention mechanism to the Softmax computation. Additionally, data sharding for acceleration can disrupt the core logic of the Transformer, and when computing the exponential part of extremely small Euler's numbers, slight output differences lead to inefficiency in Softmax computation. To address these challenges, we propose SegTransformer, a ReRAM-based PIM accelerator that enhances matrix computation speed through segmentation techniques and generates segmented data for local Softmax operations. Moreover, we introduce an Integrated Softmax Processing Unit (ISPU), which computes both local Softmax and global factors to reduce errors and improve efficiency. Experimental results show that SegTransformer outperforms state-of-the-art Transformer accelerators.

LK02 ASD Lunchtime Keynote

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 13:15 CEST - 14:00 CEST
Location / Room: Auditorium Pasteur

Time	Label	Presentation Title Authors
13:15 CEST	LK02.1	AI/ML AT THE FOREFRONT OF SEMICONDUCTOR EVOLUTION: ENHANCING DESIGN, EFFICIENCY, AND PERFORMANCE Presenter: Yankin Tanurhan, Synopsys, US Author: Yankin Tanurhan, Synopsys, US Abstract As artificial intelligence (AI) and machine learning (ML) drive innovation, their impact on the semiconductor market is transformative. This keynote will explore the latest AI/ML trends and their implications for SoC designs targeting high-performance compute, edge AI, and IoT applications. The presentation will cover AI/ML's role in developing next-generation semiconductor designs, including how AI/ML algorithms are incorporated into EDA tools to optimize chip design and enable efficient verification and manufacturing. Emerging AI/ML trends driving requirements for advanced neural processing units (NPU) will be explored, including generative AI applications like large language models and text-to-image generators. Finally, the role of transformer-based neural networks in implementing energy-efficient SoCs will be discussed.

ET03 Lifecycle Management of Emerging Memories: Why and How?

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: St Clair 1

Organisers:
Leticia Maria Bolzani Poehls, Leibniz-Institut für innovative Mikroelektronik, DE
Moritz Fieback, TU Delft, NL

Abstract:

Emerging memory technologies, such as Resistive RAM (ReRAM), Phase-Change Memory (PCM), Spin-Transfer Torque Magnetic Memory (STT-MRAM), and Ferroelectric FET (FeFET), receive a lot of interest both from academia and industry thanks to their attractive properties. These technologies can implement dense, fast, and non-volatile memories that can be used to efficiently store date as well as implement AI circuits. However, mass production is still limited, because these technologies suffer from quality and reliability issues that need to be addressed after manufacturing and during lifetime. These technologies are susceptible to new manufacturing defects due to new materials and structures as well as endurance problems. This tutorial presents a holistic view on the root causes of quality and reliability issues, their impact on the circuit’s behavior, and possible solutions to properly address these issues guaranteeing the required quality and reliability level. Finally, this tutorial allows attendees to understand the lifecycle management choices available to ensure high-quality and -reliable emerging memories.

Speakers:

Leticia Maria Bolzani Poehls, IHP – Leibniz Institute for High Performance Microelectronics - Germany

Moritz Fieback, Delft University of Technology, The Netherlands

Target audience:

This tutorial intends to be addressed to academia (from PhD students to postdocs) and professionals from industry that would like to know more about how to guarantee the quality of emerging memories and consequently their adoption in real applications. Around 40 participants are expected.

Learning objectives:

Describe why emerging memories need lifecycle management and how this holistic approach fits in the memories’ design process
Present and compare the lifecycle management of two different types of emerging memories including their quality and reliability issues and possible solutions
Summarize the key challenges that are involved in future lifecycle management for emerging memories

Required background:

Basic understanding of emerging memories and some general understanding of the definitions relate to the theory of test, and reliability.

Detailed program:

The proposed tutorial is based in the following plan:

Introduction: Why we need emerging memories?
Background: Why we need to adopt a lifecycle management approach for emerging memories?
Case study 1: Memory type, RRAMs
Case study 2: Memory type, STT-MRAMs
Comparison highlighting overlapping and differentiating features of two technologies
Conclusion & Future

FS10 Focus Session - GenAI-Native EDA: Redifining Verification with Large Language Models

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Auditorium Pasteur

Session chair:
Pierre-Emmanuel Gaillardon, University of Utah, US

Organiser:
Pierre-Emmanuel Gaillardon, University of Utah, US

As hardware design processes grow in complexity and scale, verification methodologies reliant on human expertise and manual effort are increasingly insufficient to handle intricate interdependencies and challenging constraints across design stages. Generative AI (GenAI), particularly Large Language Models (LLMs), offers a breakthrough approach, enabling sophisticated pattern recognition, multimodal data integration, and adaptive learning to tackle these verification challenges. From early-stage Power, Performance, and Area (PPA) estimations to advanced anomaly detection and layout optimization, AI-driven tools are set to transform verification workflows. By synthesizing circuit representations across specifications, netlists, and physical layouts, these AI models promise not only enhanced verification precision but also significant reductions in time-to-market, making verification processes more scalable for next-generation design technologies.In this special session, attendees will explore cutting-edge GenAI applications for EDA with a focus on hardware verification, gaining insights into how advanced techniques like multimodal learning, LLM-based optimization, and multi-agent systems can boost verification accuracy. Discussions will highlight foundational shifts toward AI-native EDA tools and examine the potential of LLMs to automate and scale verification to meet the demands of increasingly complex hardware systems. Presentations will also cover AI- driven approaches to optimizing verification workflows and automating the detection of potential design flaws.

Time	Label	Presentation Title Authors
14:00 CEST	FS10.1	REVOLUTIONIZING VERIFICATION WITH GENAI-POWERED AUTOMATION: A PARADIGM SHIFT TOWARDS AGENTIC WORKFLOWS Presenter: Andy Penrose, Cadence Design Systems, GB Author: Andy Penrose, Cadence Design Systems, GB Abstract The verification of complex System-on-Chip (SoC) designs is undergoing a significant transformation with the advent of Artificial Intelligence (AI) powered automation. Traditional single-run, single-engine algorithms are giving way to data-driven approaches that leverage insights from multiple runs of diverse engines across the entire verification campaign. The Verisium platform exemplifies this shift, optimizing verification workloads, enhancing coverage, and accelerating bug diagnosis. Recent advances in Large Language Models (LLMs) and Large Reasoning Models (LRMs) are further propelling this trend, profoundly impacting the capabilities of automation tools. While individual applications of LLMs can yield substantial productivity gains, the most profound benefits will emerge from integrating multiple GenAI powered tasks into multi-step automated workflows. This vision of an agentic workflow for design and verification represents the future trajectory of the field.
14:30 CEST	FS10.2	GENERATIVE AI IN SEMICONDUCTOR DEVELOPMENT: AUTOMATING DESIGN AND VERIFICATION FOR THE NEXT ERA OF INNOVATION Presenter: Patrick Blestel, Synopsys, FR Author: Patrick Blestel, Synopsys, FR Abstract Generative AI (GenAI) is reshaping the semiconductor industry by streamlining both design and verification, traditionally seen as distinct yet interdependent disciplines. With the increasing complexity of modern chips, conventional methodologies face growing challenges in scalability, efficiency, and verification coverage. GenAI introduces new opportunities by automating RTL generation, optimizing design space exploration, and enhancing verification processes through AI-driven test generation and formal reasoning. This talk explores how GenAI can accelerate chip development while reducing the manual effort required for design validation. We discuss the potential benefits, including improved design efficiency, faster iteration cycles, and enhanced verification robustness, as well as key considerations such as AI trustworthiness, integration with existing workflows, and the evolving role of engineers in an AI-augmented development environment. As the industry moves toward more autonomous design and verification flows, understanding the opportunities and challenges of GenAI-driven automation will be critical in shaping the future of semiconductor innovation.
15:00 CEST	FS10.3	EDA-AWARE RTL GENERATION WITH LARGE LANGUAGE MODELS Speaker: Valerio Tenace, PrimisAI, US Authors: Mubashir Islam¹, Humza Sami¹, Pierre-Emmanuel Gaillardon² and Valerio Tenace¹ ¹PrimisAI, US; ²University of Utah -- PrimisAI, US Abstract Large Language Models (LLMs) have become increasingly popular for generating RTL code. How- ever, producing error-free RTL code in a zero-shot setting remains highly challenging even for state-of-the-art LLMs, often leading to issues that require manual, iterative refinement. This additional debugging process can dramatically increase the verification workload, underscoring the need for robust, automated correction mechanisms to ensure code correctness from the start. We will AIVRIL2, a self-verifying, LLM-agnostic agentic framework aimed at enhancing RTL code generation through iterative corrections of both syntax and functional errors. Our approach leverages a collaborative multi-agent system that incorporates feedback from error logs generated by EDA tools to automatically identify and resolve design flaws. Experimental results, conducted on the VerilogEval-Human benchmark suite, demonstrate that our framework significantly improves code quality, achieving nearly a 3.4× enhancement over prior methods. In the best-case scenario, functional pass rates of 77% for Verilog and 66% for VHDL were obtained, thus substantially improving the reliability of LLM-driven RTL code generation.

LKS03 Later … with the keynote speakers

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:00 CEST
Location / Room: Terreaux VIP Lounge

TS11 Architectural and microarchitectural design - 1

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Rhône 3AB

Session chair:
Angeliki Kritikakou, Université de Rennes, FR

Session co-chair:
Olivier Sentieys, INRIA, FR

Time	Label	Presentation Title Authors
14:00 CEST	TS11.1	ACCELERATING AUTHENTICATED BLOCK CIPHERS VIA RISC-V CUSTOM CRYPTOGRAPHY INSTRUCTIONS Speaker: Yuhang Qiu, State Key Lab of Processors, Institute of Computing Technology, CAS, Beijing, China, CN Authors: Qiu Yuhang, Wenming Li, Liu Tianyu, Wang Zhen, Zhang Zhiyuan, Fan Zhihua, Ye Xiaochun, Fan Dongrui and Tang Zhimin, State Key Lab of Processors, Institute of Computing Technology, CAS, CN Abstract As one of the standardized encryption algorithms, authenticated block ciphers based on Galois/Counter Mode (GCM) is a widely-used method to guarantee the accuracy and reliability in data transmission. Across the execution process of authenticated block ciphers the authentication operation is the main performance bottleneck because it introduces operations in high-dimensional Galois field (GF) which could not be efficiently executed via existing ISA. To overcome this problem, we propose a custom ISA extension and cooperate it with RISC-V cryptography extension to accelerate the whole process of authenticated block ciphers. Besides, we propose the specific hardware design including a fully-pipelined GF(2^128) multiplier to support the extended instructions and integrate it into the multi-issue out-of-order core XT910 without introducing any clock frequency overhead. The proposed design manages to accelerate the main operations in various kind of authenticated block ciphers. We compare the performance of our designs to other existing acceleration scheme based on RISC-V ISA extension. Experimental result shows that our design outperforms other related work and achieves up to 17x speedup with a lightweight hardware overhead.
14:05 CEST	TS11.2	NDPAGE: EFFICIENT ADDRESS TRANSLATION FOR NEAR-DATA PROCESSING ARCHITECTURES VIA TAILORED PAGE TABLE Speaker: Qingcai Jiang, University of Science and Technology of China, CN Authors: Qingcai Jiang, Buxin Tu and Hong An, University of Science and Technology of China, CN Abstract Near-Data Processing (NDP) has been a promising architectural paradigm to address the memory wall problem for data-intensive applications. Practical implementation of NDP architectures calls for system support for better programmability, where having virtual memory (VM) is critical. Modern computing systems incorporate a 4-level page table design to support address translation in VM. However, simply adopting an existing 4-level page table design in NDP systems causes significant address translation overhead because (1) NDP applications generate a lot of address translation requests, and (2) the limited L1 cache in NDP systems cannot cover the accesses to page table entries (PTEs). We extensively analyze the 4-level page table design and observe that (1) the memory access to page table entries is highly irregular, thus cannot benefit from the L1 cache, and (2) the last two levels of page tables are nearly fully occupied. Based on our observations, we propose NDPage, an efficient page table design tailored for NDP systems. The key mechanisms of NDPage are (1) an L1 cache bypass mechanism for PTEs that not only accelerates the memory accesses of PTEs but also prevents the pollution of PTEs in the cache system, and (2) a flattened page table design that merges the last two levels of page tables, allowing the page table to enjoy the flexibility of a 4KB page while reducing the number of PTE accesses. We evaluate NDPage using a variety of data-intensive workloads. Our evaluation shows that in a single-core NDP system, NDPage improves the end-to-end performance over the state-of-the-art address translation mechanism of 14.3\%; in 4-core and 8-core NDP systems, NDPage enhances the performance of 9.8\% and 30.5\%, respectively.
14:10 CEST	TS11.3	SPIRE: INFERRING HARDWARE BOTTLENECKS FROM PERFORMANCE COUNTER DATA Speaker: Nicholas Wendt, University of Michigan, US Authors: Nicholas Wendt¹, Mahesh Ketkar² and Valeria Bertacco¹ ¹University of Michigan, US; ²Intel Labs, US Abstract The persistent demand for greater computing efficiency, coupled with diminishing returns from semiconductor scaling, has led to increased microarchitecture complexity and diversity. Thus, it has become increasingly difficult for application developers and hardware architects to accurately identify low-level performance bottlenecks. Abstract performance models, such as roofline models, help but strip away important microarchitectural details. In contrast, analyses based on hardware performance counters preserve detail but are challenging to implement. This work proposes SPIRE, a novel performance model that combines the accessibility and generality of roofline models with the microarchitectural detail of performance counters. SPIRE ([S]tatistical [P]iecewise L[i]near [R]oofline [E]nsemble) uses a collection of roofline models to estimate a processor's maximum throughput, based on data from its performance counters. Training this ensemble simply requires sampling data from a processor's performance counters. After training a SPIRE model on 23 workloads running on a CPU, we evaluated it with 4 new workloads and compared our findings against a commercial performance analysis tool. We found that our SPIRE analysis accurately identified many of the same bottlenecks while requiring minimal deployment effort.
14:15 CEST	TS11.4	IMPROVING ADDRESS TRANSLATION IN TAGLESS DRAM CACHE BY CACHING PTE PAGES Speaker: Osang Kwon, Sungkyunkwan University, KR Authors: Osang Kwon, Yongho Lee and Seokin Hong, Sungkyunkwan University, KR Abstract This paper proposes a novel caching mechanism for PTE pages to enhance the Tagless DRAM Cache architecture and improve address translation in large in-package DRAM caches. Existing OS-managed DRAM cache architectures have achieved significant performance improvements by focusing on efficient tag management. However, prior studies have been limited in that they only update the PTE after caching pages, without directly accessing PTEs from the DRAM cache. This limitation leads to performance degradation during page walks. To address this issue, we propose a method to copy both data pages and PTE pages simultaneously to the DRAM cache. This approach reduces address translation and cache access latency. Additionally, we introduce a shootdown mechanism to maintain the consistency of PTEs and page walk caches in multi-core systems, ensuring that all cores access the latest information for shared pages. Experimental results demonstrate that the proposed Caching PTE pages can reduce address translation overhead by up to 33.3% compared to traditional OS-managed tagless DRAM caches, improving overall program execution time by an average of 10.5%. This effectively mitigates bottlenecks caused by address translation.
14:20 CEST	TS11.5	EXPLORING THE SPARSITY-QUANTIZATION INTERPLAY ON A NOVEL HYBRID SNN EVENT-DRIVEN ARCHITECTURE Speaker: Tosiron Adegbija, University of Arizona, US Authors: Ilkin Aliyev, Jesus Lopez and Tosiron Adegbija, University of Arizona, US Abstract Spiking Neural Networks (SNNs) offer potential advantages in energy efficiency but currently trail Artificial Neural Networks (ANNs) in versatility, largely due to challenges in efficient input encoding. Recent work shows that direct coding achieves superior accuracy with fewer timesteps than traditional rate coding. However, there is a lack of specialized hardware to fully exploit the potential of direct-coded SNNs, especially their mix of dense and sparse layers. This work proposes the first hybrid inference architecture for direct-coded SNNs. The proposed hardware architecture comprises a dense core to efficiently process the input layer and sparse cores optimized for event-driven spiking convolutions. Furthermore, for the first time, we investigate and quantify the quantization effect on sparsity. Our experiments on two variations of the VGG9 network and implemented on a Xilinx Virtex UltraScale+ FPGA (Field-Programmable Gate Array) reveal two novel findings. Firstly, quantization increases the network sparsity by up to 15.2% with minimal loss of accuracy. Combined with the inherent low power benefits, this leads to a 3.4x improvement in energy compared to the full-precision version. Secondly, direct coding outperforms rate coding, achieving a 10% improvement in accuracy and consuming 26.4x less energy per image. Overall, our accelerator achieves ~51x higher throughput and consumes half the power compared to previous work. Our accelerator code is available at: https://github.com/githubofaliyev/SNN-DSE/tree/DATE25.
14:25 CEST	TS11.6	SWIFT-SIM: A MODULAR AND HYBRID GPU ARCHITECTURE SIMULATION FRAMEWORK Speaker: Xiangrong Xu, Beihang University, CN Authors: Xiangrong Xu, Yuanqiu Lv, Liang Wang, Limin Xiao, Meng Han, Runnan Shen and Jinquan Wang, Beihang University, CN Abstract Simulation tools are critical for architects to quickly estimate the impact of aggressive new features of GPU architecture. Existing cycle-accurate GPU simulators are typically cumbersome and slow to run. We observe that it is time-consuming and unnecessary for cycle-accurate GPU simulators to perform detailed simulations for the entire GPU when exploring the design space of specific components. This paper proposes Swift-Sim, a modular and hybrid GPU simulation framework. With a highly modular design, our framework can choose appropriate modeling approaches for each component according to requirements. For components of interest to architects, we use cycle-accurate simulation to evaluate new GPU architectures. For other components, we use analytical modeling, which accelerates simulation speed with only minor and acceptable degradation in overall accuracy. Based on this simulation framework, we present two working examples of hybrid modeling that simulate the ALU pipeline and memory accesses using analytical models. We further implement two GPU performance simulators with different levels of simplification based on Swift-Sim and evaluate them using configurations from real GPUs. The results show that the two simulators achieve an 82.6x and 211.2x geometric mean speedup compared to Accel-Sim with insignificant accuracy degradation.
14:30 CEST	TS11.7	HYMM: A HYBRID SPARSE-DENSE MATRIX MULTIPLICATION ACCELERATOR FOR GCNS Speaker: Hunjong Lee, Korea University, KR Authors: Hunjong Lee¹, Jihun Lee¹, Jaewon Seo¹, Yunho Oh¹, Myungkuk Yoon² and Gunjae Koo¹ ¹Korea University, KR; ²Ewha Womans University, KR Abstract Graph convolutional networks (GCNs) are emerging neural network models designed to process graph-structured data. Due to massively parallel computations using irregular data structures by GCNs, traditional processors such as CPUs, GPUs, and TPUs exhibit significant inefficiency when performing GCN inferences. Even though researchers have proposed several GCN accelerators, the prior dataflow architectures struggle with inefficient data utilization due to the divergent and irregularly structured graph data. In order to overcome such performance hurdles, we propose a hybrid dataflow architecture for sparse-dense matrix multiplications (SpDeMMs), called HyMM. HyMM employs disparate dataflow architectures using different data formats to achieve more efficient data reuse across varying degree levels within graph structures, hence HyMM can reduce off-chip memory accesses significantly. We implement the cycle-accurate simulator to evaluate the performance of HyMM. Our evaluation results demonstrate HyMM can achieve up to 4.78x performance uplift by reducing off-chip memory accesses by 91% compared to the conventional non-hybrid dataflow.
14:35 CEST	TS11.8	BUDDY ECC: MAKING CACHE MOSTLY CLEAN IN CXL-BASED MEMORY SYSTEMS FOR ENHANCED ERROR CORRECTION AT LOW COST Speaker: Yongho Lee, Sungkyunkwan University, KR Authors: Yongho Lee, Junbum Park, Osang Kwon, Sungbin Jang and Seokin Hong, Sungkyunkwan University, KR Abstract As Compute Express Link (CXL) emerges as a key memory interconnect, interest in optimization opportunities and challenges has grown. However, due to the different characteristics of the CXL Memory Module (CMM) compared to traditional DRAM-based Dual In-line Memory Modules (DIMMs), existing optimizations may not be effectively applied. In this paper, we propose an Proactively Write-back Policy that leverages the full-duplex nature and features of the CMM to optimize bandwidth, enhance reliability, and reduce area overhead. First, the Proactively Write-back improves bandwidth efficiency by minimizing dirty cachelines in the last-level cache through dead block prediction, proactively identifying and writing back cachelines that are unlikely to be rewritten. Second, the Utilization-aware Policy dynamically monitors the internal bandwidth of the CMM, sending write-back requests only when the module is under low load rate, thus preventing performance degradation during high traffic. Finally, the robust Buddy ECC scheme enhances data reliability by separating Error Detection Code (EDC) for clean cachelines and stronger Error Correction Code (ECC) for dirty cachelines. Buddy ECC improved bandwidth utilization by 46%, limited performance degradation to 0.33%, and kept energy consumption increase under 1%.
14:40 CEST	TS11.9	A PERFORMANCE ANALYSIS OF CHIPLET-BASED SYSTEMS Speaker: Neethu Bal Mallya, Department of Computer Science and Engineering, Chalmers University of Technology, Sweden, SE Authors: Neethu Bal Mallya, Panagiotis Strikos, Bhavishya Goel, Ahsen Ejaz and Ioannis Sourdis, Chalmers University of Technology, SE Abstract As the semiconductor industry struggles to keep Moore's law alive and integrate more functionality on a chip, multi-chiplet chips offer a lower cost alternative to large monolithic chips due to their higher yield. However, chiplet-based chips are naturally Non-Uniform Memory Access (NUMA) systems and therefore suffer from slow remote accesses. NUMA overheads are exacerbated by the limited throughput and higher latency of inter-chiplet communication. This paper offers a comprehensive analysis of chiplet-based systems with different design parameters measuring their performance overheads compared to traditional monolithic multicore designs and their scalability to system and chiplet size. Several design alternatives pertaining to the memory hierarchy, interconnects, and technology aspects are studied. Our analysis shows that although chiplet-based chips can cut (recurring engineering) costs to half, they may give away over a third of the monolithic performance. Part of this performance overhead can be regained with specific design choices.
14:45 CEST	TS11.10	A HIGH-PERFORMANCE AND FLEXIBLE ACCELERATOR FOR DYNAMIC GRAPH CONVOLUTIONAL NETWORKS Speaker: Ke Wang, University of North Carolina at Charlotte, US Authors: Yingnan Zhao¹, Ke Wang² and Ahmed Louri¹ ¹The George Washington University, US; ²University of North Carolina at Charlotte, US Abstract Dynamic Graph Convolutional Networks (DGCNs) have been applied to various dynamic graph-related applications, such as social networks, to achieve high inference accuracy. Typically, each DGCN layer consists of two distinct modules: a Graph Convolutional Network (GCN) module that captures spatial information, and a Recurrent Neural Network (RNN) module that extracts temporal information from input dynamic graphs. The different functionalities of these modules pose significant challenges for hardware platforms, particularly in achieving high-performance and energy-efficient inference processing. To this end, this paper introduces HiFlex, a high-performance and flexible accelerator designed for DGCN inference. At the architecture level, HiFlex implements multiple homogeneous processing elements (PEs) to perform main computations for GCN and RNN modules, along with a versatile interconnection fabric to optimize data communication and enhance on-chip data reuse efficiency. The flexible interconnection fabric can be dynamically configured to provide various on-chip topologies, supporting point-to-point and multicast communication patterns needed for GCN and RNN processing. At the algorithm level, HiFlex introduces a dynamic control policy that partitions, allocates, and configures hardware resources for distinct modules based on their computational requirements. Evaluation results using real-world dynamic graphs demonstrate that HiFlex achieves, on average, a 38% reduction in execution time and a 42% decrease in energy consumption for DGCN inference, compared to state-of-the-art approaches such as ES-DGCN, ReaDy, and RACE.
14:50 CEST	TS11.11	AMPHI: PRACTICAL AND INTELLIGENT DATA PREFETCHING FOR THE FIRST-LEVEL CACHE Speaker: Zicong Wang, College of Computer Science and Technology, National University of Defense Technology, CN Authors: Xuan Tang, Zicong Wang, Shuiyi He, Dezun Dong and Xiangke Liao, National University of Defense Technology, CN Abstract Data prefetchers play a crucial role in alleviating the memory wall by predicting future memory accesses. First-level cache prefetchers can observe all memory instructions but often rely on simpler strategies due to limited resources. While emerging machine learning-based approaches cover more memory access patterns, they typically require higher computational and storage resources and are usually deployed in the last-level cache. Other intelligent solutions for the first-level cache show only modest performance gains. To address this, we propose Amphi, the first practical and intelligent data prefetcher specifically designed for the first-level cache. Applying a binarized temporal convolutional network, Amphi significantly reduces storage overhead while maintaining performance comparable to the SOTA intelligent prefetcher. With a storage overhead of only 3.4 KB, Amphi requires only one-eighth of Pythia's storage needs. Amphi paves the way for the broader adoption of intelligence-driven prefetching solutions.

TS12 Smart and Autonomous Systems for a Smart World

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Salon Pasteur

Session chair:
Heba Khdr, Karlsruhe Institute of Technology, DE

Session co-chair:
Geoff Merrett, University of Southampton, UK

Time	Label	Presentation Title Authors
14:00 CEST	TS12.1	SACK: ENABLING ENVIRONMENTAL SITUATION-AWARE ACCESS CONTROL FOR AUTONOMOUS VEHICLES IN LINUX KERNEL Speaker: Boyan Chen, Peking University, CN Authors: Boyan Chen¹, Qingni Shen¹, Lei Xue², Jiarui She¹, Xiaolei Zhang¹, Xiapu Luo³, Xin Zhang¹, Wei Chen¹ and Zhonghai Wu¹ ¹Peking University, CN; ²Sun Yat-Sen University, CN; ³The Hong Kong Polytechnic Unversity, HK Abstract Connected and autonomous vehicles (CAVs) operate in open and evolving environments, which require timely and adaptive permission restriction to address dynamic risks that arise from changes in environmental situations (hereinafter referred to as situations), such as emergency situations due to vehicle crashes. Enforcing situation-aware access control is an effective approach to support adaptive permission restriction. Current works mainly implement situation-aware access control in the permission framework and API monitoring in user space. They are vulnerable to being bypassed and are coarse-grained. Autonomous systems have widely adopted mandatory access control (MAC) to configure and enforce system-wide and fine-grained access control policies. However, the MAC supported by Linux security modules (LSM) relies on pre-defined security contexts (e.g., type) and relatively fixed permission transition conditions (e.g., exec syscall), which lacks consideration of environmental factors. To address these issues, we propose a Situation-aware Access Control framework in the Kernel (SACK), which enforces adaptive permission restriction based on environmental factors for CAVs. Incorporating environmental situations into the LSM framework is not straightforward. SACK introduces situation states as a new security context for abstracting environmental factors in the kernel. Subsequently, SACK utilizes a situation state machine to implement new adaptive permission transitions triggered by situation events. In addition, SACK provides a novel situation-aware policy language that links specific user space permissions to MAC rules while maintaining compatibility with other LSMs such as AppArmor. We develop two prototypes: an independent SACK with its own policies and a SACK-enhanced AppArmor that adaptively updates the corresponding policies of AppArmor. The experimental results demonstrate that SACK can efficiently enforce situation-adaptive permissions with negligible runtime overhead.
14:05 CEST	TS12.2	EXPLOITING SYSML V2 MODELING FOR AUTOMATIC SMART FACTORIES CONFIGURATION Speaker: Mario Libro, Università di Verona, IT Authors: Mario Libro¹, Sebastiano Gaiardelli¹, Marco Panato², Stefano Spellini², Michele Lora¹ and Franco Fummi¹ ¹Università di Verona, IT; ²Factoryal S.r.l., IT Abstract Smart factories are complex environments equipped with both production machinery and computing devices that collect, share, and analyze data. For this reason, the modeling of today's factories can no longer rely on traditional methods, and computer engineering tools, such as SysML, must be employed. At the same time, the current SysML v1.* standard does not provide the rigorousness required to model the complexity and the criticalities of a smart factory. Recently, SysML v2 has been proposed and is about to be released as the new version of the standard. Its release candidate version shows the new version aims at providing a more rigorous and complete modeling language, able to fulfill the requirements of the smart factory domain. In this paper, we explore the capabilities of the new SysML v2 standard by building a rigorous modeling strategy, able to capture the aspects of a smart factory related to the production process, the computation, and the communication. We apply the proposed strategy to model a fully-fledged smart factory, and we rely on models to automatically configure the different pieces of equipment and software components in the factory.
14:10 CEST	TS12.3	HIDP: HIERARCHICAL DNN PARTITIONING FOR DISTRIBUTED INFERENCE ON HETEROGENEOUS EDGE PLATFORMS Speaker: Zain Taufique, University of Turku, FI Authors: Zain Taufique¹, Aman Vyas¹, Antonio Miele², Pasi Liljeberg¹ and Anil Kanduri¹ ¹University of Turku, FI; ²Politecnico di Milano, IT Abstract Edge inference techniques partition and distribute Deep Neural Network (DNN) inference tasks among multiple edge nodes for low latency inference, without considering the core-level heterogeneity of edge nodes. Further, default DNN inference frameworks also do not fully utilize the resources of heterogeneous edge nodes, resulting in higher inference latency. In this work, we propose a hierarchical DNN partitioning strategy (HiDP) for distributed inference on heterogeneous edge nodes. Our strategy hierarchically partitions DNN workloads at both global and local levels by considering the core-level heterogeneity of edge nodes. We evaluated our proposed HiDP strategy against relevant distributed inference techniques over widely used DNN models on commercial edge devices. On average our strategy achieved 38% lower latency, 46% lower energy, and 56% higher throughput in comparison with other relevant approaches.
14:15 CEST	TS12.4	COUPLING NEURAL NETWORKS AND PHYSICS EQUATIONS FOR LI-ION BATTERY STATE-OF-CHARGE PREDICTION Speaker: Giovanni Pollo, Politecnico di Torino, IT Authors: Giovanni Pollo¹, Alessio Burrello², Enrico Macii¹, Massimo Poncino¹, Sara Vinco¹ and Daniele Jahier Pagliari¹ ¹Politecnico di Torino, IT; ²Politecnico di Torino \| Università di Bologna, IT Abstract Estimating the evolution of the battery's State of Charge (SoC) in response to its usage is critical for implementing effective power management policies and for ultimately improving the system's lifetime. Most existing estimation methods are either physics-based digital twins of the battery or data-driven models such as Neural Networks (NNs). In this work, we propose two new contributions in this domain. First, we introduce a novel NN architecture formed by two cascaded branches: one to predict the current SoC based on sensor readings, and one to estimate the SoC at a future time as a function of the load behavior. Second, we integrate battery dynamics equations into the training of our NN, merging the physics-based and data-driven approaches, to improve the models' generalization over variable prediction horizons. We validate our approach on two publicly accessible datasets, showing that our Physics-Informed Neural Networks (PINNs) outperform purely data-driven ones while also obtaining superior prediction accuracy with a smaller architecture with respect to the state-of-the-art.
14:20 CEST	TS12.5	AUTONOMOUS UAV-ASSISTED IOT SYSTEMS WITH DEEP REINFORCEMENT LEARNING BASED DATA FERRY Speaker: Mason Conkel, The University of Texas at San Antonio, US Authors: Mason Conkel¹, Wen Zhang², Mimi Xie¹, Yufang Jin¹ and Chen Pan¹ ¹The University of Texas at San Antonio, US; ²Wright State University, US Abstract Emerging unmanned aerial vehicle (UAV) technology offers reliable, flexible, and controllable techniques for transferring data collected by wireless internet of things (IoT) devices located in remote areas. However, deploying UAVs faces limitations in mission distance to recharging, especially when recharge occurs far from the monitoring. To address these challenges, we propose smart charging stations installed within the monitoring area equipped with energy-harvest features and communication modules. These stations can replenish the UAV's energy and act as cluster heads by collecting information from IoT devices within their jurisdiction. This allows a UAV to operate continuously by downloading while charging and forwarding the data to the remote server during flight. Despite these improvements, the unpredictable nature of energy-harvest devices and charging needs can lead to stale or obsolete information at cluster heads. The limited communication range may prevent the cluster heads from establishing connections with all nodes in their jurisdiction. To overcome these issues, we proposed an age-of-information-aware data ferry algorithm using deep reinforcement learning to determine the UAV's flight path. The deep reinforcement learning agent, running on cluster heads, utilizes a global state gathered by the UAV to output the location of the next stop, which can be a cluster head or an IoT device. The experiments show that the algorithm can minimize the age of information without diminishing data collection.
14:25 CEST	TS12.6	AERODIFFUSION: COMPLEX AERIAL IMAGE SYNTHESIS WITH DYNAMIC TEXT DESCRIPTIONS AND FEATURE-AUGMENTED DIFFUSION MODELS Speaker: Douglas Townsell, Wright State University, US Authors: Douglas Townsell¹, Mimi Xie², Bin Wang¹, Fathi Amsaad¹, Varshitha Thanam³ and Wen Zhang¹ ¹Wright State University, US; ²The University of Texas at San Antonio, US; ³wright state university, US Abstract Aerial imagery provides crucial insights for various fields, including remote monitoring, environmental assessment, and autonomous navigation. However, the availability of aerial image datasets is limited due to privacy concerns and imbalanced data distribution, impeding the development of robust deep learning models. Recent advancements in text-guided image synthesis offer a promising approach to enrich and diversify these datasets. Despite progress, existing generative models face challenges in synthesizing realistic aerial images due to the lack of paired text-aerial datasets, the complexity of densely packed objects, and the limitations of modeling object relationships. In this paper, we introduce extbf{{proposedmodel}}, a novel framework designed to overcome these challenges by leveraging large language models (LLMs) for keypoint-aware text description generation and a feature-augmented diffusion process for realistic image synthesis. Our approach integrates region-level feature extraction to preserve small objects and multi-modal feature alignment to improve textual descriptions of complex aerial scenes. extbf{{proposedmodel}} is the first to extend deep generative models for high-resolution, text-guided aerial image generation, including the creation of images from novel viewpoints. We contribute a new paired text-aerial image dataset and demonstrate the effectiveness of our model, achieving an FID score of 78.15 across five benchmarks, significantly outperforming state-of-the-art models such as DDPM (217.95), Stable Diffusion (119.13), and ARLDM (111.59).
14:30 CEST	TS12.7	POWER- AND DEADLINE-AWARE DYNAMIC INFERENCE ON INTERMITTENT COMPUTING SYSTEMS Speaker: Hengrui Zhao, University of Southampton, GB Authors: Hengrui Zhao, Lei Xun, Jagmohan Chauhan and Geoff Merrett, University of Southampton, GB Abstract In energy-harvesting intermittent computing systems, balancing power constraints with the need for timely and accurate inference remains a critical challenge. Existing methods often sacrifice significant accuracy or fail to adapt effectively to fluctuating power conditions. This paper presents DualAdaptNet, a power- and deadline-aware neural network architecture that dynamically adapts both its width and depth to ensure reliable inference under variable power conditions. Additionally, a runtime scheduling method is introduced to select an appropriate sub-network configuration based on real-time energy-harvesting conditions and system deadlines. Experimental results on the MNIST dataset demonstrate that our approach completes up to 7.0% more inference tasks within a specified deadline while also improving average accuracy by 15.4% compared to the state-of-the-art.
14:35 CEST	TS12.8	DCHA: DISTRIBUTED-CENTRALIZED HETEROGENEOUS ARCHITECTURE ENABLES EFFICIENT MULTI-TASK PROCESSING FOR SMART SENSING Speaker: Cheng Qu, Beijing University of Posts and Telecommunication, CN Authors: Erxiang Ren¹, Cheng Qu², Li Luo¹, Yonghua Li², Zheyu Liu³, Xinghua Yang⁴, Qi Wei⁵ and Fei Qiao⁵ ¹Beijing Jiaotong University, CN; ²Beijing University of Posts and Telecommunications, CN; ³MakeSens AI, CN; ⁴Beijing Forestry University, CN; ⁵Tsinghua University, CN Abstract The rapid development of artificial intelligence (AI) has accelerated the progression of IoT technology into the smart era. Integrating AI processing capabilities into IoT devices to create smart sensing systems holds significant promise. In this work, we propose a distributed-centralized heterogeneous architecture that enables efficient multi-task processing for smart sensing. This architecture improves the operational efficiency of sensing systems and enhances the deployment scalability through collaborative computing across end, edge, and center nodes. Specifically, we partition the network in traditional centralized sensing systems into several parts and perform algorithm-hardware co-design for each part on its respective deployment platform. We developed a sample design to validate the proposed architecture. By implementing a lightweight image encoder, we achieved an 88x reduction in encoder parameters and up to 9873x energy gain, facilitating deployment on resource-constrained devices. Experimental results demonstrate that the proposed architecture effectively reduces overall energy consumption by 0.0573x to 0.0889x, while maintaining robust multi-task inference capabilities. Moreover, energy consumption reductions of 2.88x to 3.22x on edge nodes and 6311.56x to 10037.23x on end nodes were observed.
14:40 CEST	TS12.9	FAIRXBAR: IMPROVING THE FAIRNESS OF DEEP NEURAL NETWORKS WITH NON-IDEAL IN-MEMORY COMPUTING HARDWARE Speaker: Cheng Wang, Iowa State University of Science and Technology, US Authors: Sohan Salahuddin Mugdho¹, Yuanbo Guo², Ethan Rogers¹, Weiwei Zhao¹, Yiyu Shi² and Cheng Wang¹ ¹Iowa State University of Science and Technology, US; ²University of Notre Dame, US Abstract While artificial intelligence (AI) based on deep neural networks (DNN) has achieved near-human performance in various cognitive tasks, such data-driven models are known to exhibit implicit bias against specific subgroups, leading to fairness issues. Most existing methods for improving model fairness only consider software-based optimizations, while the impact of hardware is largely unexplored. In this work, we investigate the impact of underlying hardware technology on AI fairness as we deploy DNN-based medical diagnosis algorithms onto in-memory computing hardware accelerators. Based on our newly developed framework that characterizes the importance of DNN weight parameters to fairness, we demonstrate that device variability-induced non-idealities such as stuck-at faults and noises due to variation can be exploited to deliver improved fairness (up to 32% improvement) with significantly reduced trade-off (less than 1% loss) of the overall accuracy. We additionally develop a hardware non-idealities-aware training methodology that further mitigates the bias between unprivileged and privileged demographic groups in our experiments on skin lesion diagnosis datasets. Our work suggests exciting opportunities for leveraging the hardware attributes in a cross-layer co-design to enable equitable and fair AI.
14:45 CEST	TS12.10	HUMAN-CENTERED DIGITAL TWIN FOR INDUSTRY 5.0 Speaker: Francesco Biondani, Università di Verona, IT Authors: Francesco Biondani¹, Luigi Capogrosso¹, Nicola Dall'Ora¹, Enrico Fraccaroli², Marco Cristani¹ and Franco Fummi¹ ¹Università di Verona, IT; ²University of North Carolina at Chapel Hill, IT Abstract Moving beyond the automation-driven paradigm of Industry 4.0, Industry 5.0 emphasizes human-centric industrial systems where human creativity and instincts complement precise and advanced machines. With this new paradigm, there is a growing need for resource-efficient and user-preferred manufacturing solutions that integrate humans into industrial processes. Unfortunately, methodologies for incorporating human elements into industrial processes remain underdeveloped. In this work, we present the first pipeline for the creation of a human-centered Digital Twin (DT), leveraging Unreal Engine's MetaHuman technology to track worker alertness in real-time. Our findings demonstrate the potential of integrating Artificial Intelligence (AI) and human-centered design within Industry 5.0 to enhance both worker safety and industrial efficiency.
14:46 CEST	TS12.11	ENERGY-AWARE ERROR CORRECTION METHOD FOR INDOOR POSITIONING AND TRACKING Speaker: Donkyu Baek, Chungbuk National University, KR Authors: Donguk Kim¹, Yukai Chen², Donkyu Baek¹, Enrico Macii³ and Massimo Poncino³ ¹Chungbuk National University, KR; ²IMEC, BE; ³Politecnico di Torino, IT Abstract Indoor positioning is crucial for the effective use of drones in smart environments, enabling precise navigation and control in complex indoor spaces where GPS signals are weak or unavailable and wireless communication-based systems must be used. In order to improve positioning accuracy, various distance measurement techniques and related error correction methods have been proposed in the literature. However, these methods are mostly focused on accuracy and often require a significant amount of computational resources, which is quite inefficient when deployed on battery-operated devices like small robots or drones because of their limited battery capacity. Moreover, conventional error correction methods are little effective for the tracking of moving objects. In this paper, we first analyze the trade-off between energy consumption and accuracy for the error correction and identify the most energy-efficient error correction method. Based on this analysis in the accuracy/energy space, we introduce a new energy-efficient error correction method that is especially targeted for tracking a moving object. We validated our solution by implementing an Ultra-Wideband based indoor positioning system and demonstrated that the proposed method improves positioning accuracy by 15% and reduces energy consumption by 33% compared to the state-of-the-art method.
14:47 CEST	TS12.12	DECENTRALIZING IOT DATA PROCESSING: THE RISE OF BLOCKCHAIN-BASED SOLUTIONS Speaker: Daniela De Venuto, Polytechnic University of Bari, IT Authors: Giuseppe Spadavecchia¹, Marco Fiore², Marina Mongiello² and Daniela De Venuto² ¹Private, IT; ²Polytechnic University of Bari, IT Abstract The rise of the Internet of Things has introduced new challenges related to data security and transparency, especially in industries like agri-food where traceability is critical. Traditional cloud-based solutions, while scalable, pose security and privacy risks. This paper proposes a decentralized architecture using Blockchain technology to address these challenges. We deploy IoT sensors connected to a Raspberry Pi for edge processing and utilize Hyperledger Fabric, a private Blockchain, to manage and store data securely. Two approaches were evaluated: computation of a Discomfort Index on the Raspberry Pi (edge processing) versus performing the same computation on-chain using smart contracts. Performance metrics, including latency, throughput, and error rate, were measured using Hyperledger Caliper. The results show that edge processing offers superior performance in terms of latency and throughput, while Blockchain-based computation ensures greater transparency and trust. This study highlights the potential of Blockchain as a viable alternative to centralized cloud systems in IoT environments and suggests future research in scalability, hybrid architectures, and energy efficiency.
14:48 CEST	TS12.13	ENABLING A PORTABLE BRAIN COMPUTER INTERFACE FOR REHABILITATION OF SPINAL CORD INJURIES Speaker: Adrian Evans, CEA, FR Authors: Adrian Evans¹, Victor Roux-Sibillon², Joe Saad², Ivan Miro-Panades², Tetiana Aksenova³ and Lorena Anghel⁴ ¹CEA, FR; ²CEA-List, FR; ³CEA-Leti, FR; ⁴Grenoble-Alpes University, Grenoble, France, FR Abstract In clinical trials, brain signal decoders combined with spinal stimulation have shown to be a promising means to restore mobility to paraplegic and tetraplegic patients. To make this technology available for home use, the complex brain signal decoding must be performed using a low-power, portable battery operated system. This case study shows how the decoding algorithm for a Brain-Computer Interface (BCI) system was ported to an embedded platform, resulting in an over 25× power reduction, compared to the previous implementation, while respecting real-time and accuracy constraints.

W05 OSSMPIC - Open Source Solutions for Massively Parallel Integrated Circuits

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 18:00 CEST
Location / Room: Rhône 1

Organisers:
Caroline Collange, INRIA, FR
Kevin Martin, Université Bretagne Sud, FR
David Defour, Université de Perpignan, FR
Adrian Evans, CEA/LIST, FR
Hyesoon Kim, Georgia TU, US
Leonidis Kosmidis, Barcelona Supercomuting Centre, FR
Christine Rochange, Univesité de Toulouse, FR
Blaise Tine, University of California, Los Angeles, US
Henk Corporaal , Eindhoven University of Technology, NL
Artur Podobas, KTH, SE

Speakers:
Eric Guthmuller, CEA/LIST, FR
Kevin Martin, Université Bretagne Sud, FR

Authors:
Injae Shin, University of California, Los Angeles, US
Mohamed Bouaziz, King Abdullah University of Science and Technology, SA
Huanzhi Pu, Georgia Tech, US
Lars Luchterhandt, Heinz Nixdorf Institute, Paderborn University, Paderborn, DE
Dominik Walter, Friedrich-Alexander-Universitat Erlangen-Nuernberg, DE
Wang Wang, Berkeley University, US
Noïc Crouzet, Universite de Toulouse, FR

Keynote Speakers:
Davide Schiavone, ESL, EPFL, CH
Luca Carloni, Columbia University , US

W05.1 Session 1: Invited Talks

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:10 CEST - 15:06 CEST
Location / Room: Rhône 1

Keynote Speakers:
Blaise Tine, University of California, Los Angeles, US
Henk Corporaal , Eindhoven University of Technology, NL

This session includes two invited talks

The Hidden Costs of Open-Source Hardware Research

Prof. Blaise Tine - UCLA

In this talk, Dr. Tine will delve into the challenges and trade-offs inherent in open-source hardware research. While open-source initiatives democratize hardware development and foster innovation, they also introduce hidden complexities that can impact adoption, collaboration, and long-term sustainability. Drawing from his extensive experience in open-source hardware and system design, Dr. Tine will shed light on these often-overlooked challenges and offer strategies for researchers and practitioners to navigate them effectively. Looking ahead, Dr. Tine will discuss emerging trends that could shape the future of open-source hardware research.

CGRAs for the Edge: Balancing Compute Efficiency and Flexibility

Prof. Henk Corporaal (Eindhoven University of Technology)

Driven by AI and advanced signal-processing developments we observe a huge increase of computational requirements. Not only in the cloud, but even more at the Edge. There are substantial advantages of performing computation locally at the edge, like less data traffic, performing the computation close to the sensing data, reliability, real-time feedback and data privacy. This drives a strong demand for smart Edge computing. Edge compute devices have limited resources, and therefore require high energy- and area-efficient computing. This naturally demands for highly specialized processors. However, high specialization typically means high development costs and lower volume. Much worse, it makes them inflexible; they cannot adapt to (late) application changes and code updates, which are very common in our fast moving (software) world. Coarse Grain Reconfigurable Architectures (CGRAs) may be the solution; they aim to find a good balance between flexibility and compute efficiency. They can be easily tuned and scaled for application domains, while staying flexible, especially when they are fully programmable. In this presentation, we give an overview of CGRAs and their recent developments. We more precisely define and characterize CGRAs. We also present a metric for flexibility. Designing CGRAs results into various challenges. We illustrate key concepts and challenges using the recent open-source R-Blocks CGRA as example. Finally, we conclude by offering a glimpse into the CGRA future, exploring potential breakthroughs on the horizon.

W05.2 Session 2: Open Source GPU Applications

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 15:06 CEST - 15:30 CEST
Location / Room: Rhône 1

Session chair:
Kevin Martin, Université Bretagne Sud, FR

resources, even when multiple tasks are sharing the GPU.

Time	Label	Presentation Title Authors
15:06 CEST		A RISC-V MULTICORE AND GPU SOC PLATFORM WITH A QUALIFIABLE SOFTWARE STACK FOR SAFETY CRITICAL SYSTEMS Speaker: Kevin Martin, Université Bretagne Sud, FR
15:18 CEST	W05.2.1	GPGPUS ON FPGAS: A COMPETITIVE APPROACH FOR SCIENTIFIC COMPUTING ? Speaker: Eric Guthmuller, CEA/LIST, FR Abstract FPGA architectures include increasingly complex arithmetic operators and optimized hard IPs, such as memory subsystems and Networks-on-Chip (NoC). This evolution leads to higher compute density also linked with high memory bandwidth. It represents an opportunity to tailor an architecture to niche application needs while being competitive with a costly ASIC implementation. More specifically, scientific computing requires high precision (> 32 bits) floating point computation. However, GPU vendors are progressively favoring low precision performance for AI needs, and are even phasing out support for 64-bit floating point compute. We present an analytical study motivating the need to investigate the implementation of an open source 64-bit GPGPU architecture on a state of the art FPGA, as an alternative to GPUs for scientific computing.

W05.3 Poster Session / Coffee Break

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 15:30 CEST - 16:30 CEST
Location / Room: Rhône 1

Session chair:
Kevin Martin, Université Bretagne Sud, FR

Time	Label	Presentation Title Authors
15:30 CEST	W05.3.1	EVALUATION OF CGRA TOOLCHAINS Author: Dominik Walter, Friedrich-Alexander-Universitat Erlangen-Nuernberg, DE
15:30 CEST	W05.3.2	FROM CONCEPT TO SILICON: RAPID GPGPU CORE DESIGN AND INTEGRATION WITH OPEN-SOURCE ASIC TOOLS Author: Wang Wang, Berkeley University, US
15:30 CEST	W05.3.3	OPEN-HARDWARE GPUS AS PLATFORMS FOR RESEARCH: A FEEDBACK ON THE USE OF VORTEX Author: Noïc Crouzet, Universite de Toulouse, FR
15:30 CEST	W05.3.4	MULTIPORT SUPPORT FOR VORTEX OPENGPU MEMORY HIERARCHY Author: Injae Shin, University of California, Los Angeles, US
15:30 CEST	W05.3.5	BENCHMARKING FLOATING POINT PERFORMANCE OF MASSIVELY PARALLEL DATAFLOW OVERLAYS ON AMD VERSAL FPGA COMPUTE PRIMITIVES Author: Mohamed Bouaziz, King Abdullah University of Science and Technology, SA

W05.4 Software and Tools

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 16:52 CEST
Location / Room: Rhône 1

Session chair:
Adrian Evans, CEA/LIST, FR

Time	Label	Presentation Title Authors
16:30 CEST	W05.4.1	HARDWARE VS. SOFTWARE IMPLEMENTATION OF WARP-LEVEL FEATURES IN VORTEX RISC-V GPU Author: Huanzhi Pu, Georgia Tech, US Abstract RISC-V GPUs present a promising path for supporting GPU applications. Traditionally, GPUs achieve high efficiency through the SPMD (Single Program Multiple Data) programming model. However, modern GPU programming increasingly relies on warp-level features, which diverge from the conventional SPMD paradigm. In this paper, we explore how RISC-V GPUs can support these warp-level features both through hardware implementation and via software-only approaches. Our evaluation shows that a hardware implementation achieves up to 4 times geomean IPC speedup in microbenchmarks, while softwarebased solutions provide a viable alternative for area-constrained scenarios.
16:42 CEST	W05.4.2	CASE STUDY ON COMBINING OPEN-SOURCE TOOL FLOWS FOR GRIDS OF PROCESSING CELLS Author: Lars Luchterhandt, Heinz Nixdorf Institute, Paderborn University, Paderborn, DE Abstract Massively parallel computer architectures based on identical microprocessor tiles are well known for their high scalability and performance. In this work, we introduce an opensource tool flow for scalable on-chip grids of RISC-V processor cells that seamlessly combines high-level SystemC modeling with the generation and simulation of hardware models at RTL down to FPGA implementation featuring the Chipyard framework. Our experimental evaluation quantifies the speed-accuracy trade-offs at different abstraction levels and compares them with their physical implementation on an FPGA.

W05.5 Invited Talks

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:54 CEST - 17:50 CEST
Location / Room: Rhône 1

Session chair:
Kevin Martin, Université Bretagne Sud, FR

Time	Label	Presentation Title Authors
16:54 CEST	W05.5.1	X-HEEP + CGRAS + GPU WORK-IN-PROGRESS ACTIVITIES Keynote Speaker: Davide Schiavone, ESL, EPFL, CH Abstract This talk presents an ongoing evaluation of a Very-Wide-Register Coarse-Grained Reconfigurable Arrays (CGRAs) and a RISC-V GPU for edge computing nodes within the ESL EPFL. We are currently performing an analysis of these architectures in TSMC 16nm technology, aiming to identify optimal solutions for diverse computational workloads. This work is still in progress, but we will discuss preliminary findings, and insights, including HW and SW extensions for the open-source Vortex GPGPU. Furthermore, we will present three more CGRA designs, two of which have also been fabricated in TSMC 65nm LP. We will present initial results, highlighting their performance characteristics and potential applications. All of these accelerators are being integrated within the X-HEEP platform, a versatile RISC-V microcontroller system. X-HEEP leverages a rich ecosystem of open-source IPs, including CPUs from the OpenHW Group, uncore IPs from the PULP platform and OpenTitan project, and custom IPs. X-HEEP enables seamless integration and rapid prototyping. We will discuss the integration process and demonstrate how X-HEEP facilitates the evaluation and deployment of custom accelerators.
17:22 CEST	W05.5.2	ESP AS AN OPEN-SOURCE PLATFORM FOR MASSIVELY PARALLEL INTEGRATED CIRCUITS Keynote Speaker: Luca Carloni, Columbia University , US Abstract Open-source hardware can play a unique role to spark interdisciplinary research across computer architecture, programming languages, operating systems and computer-aided design. Further, it can enable collaborative engineering among researchers in academic, industrial and government labs. ESP is an open-source research platform for SoC design that combines a scalable tile-based architecture, and a flexible system-level design methodology. With ESP, designers can rapidly prototype a SoC architecture with multiple RISC-V processor cores and dozens of loosely coupled accelerators, all interconnected with a multiplane network-on-chip. Conceived as a heterogeneous system integration platform, ESP can scale to support the realization of massively parallel integrated circuits and chiplet-based systems.

W06 Cross-stack Explorations of Ferroelectric-based Logic and Memory Solutions for At-Scale Compute Workloads

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 18:00 CEST
Location / Room: St Clair 2

Organisers:
Michael Niemier, University of Notre Dame, US
Ian O'Connor, École Centrale de Lyon, FR
Xunzhao Yin, Zhejiang University, CN
Stefan Slesazeck, NaMLab, DE

Speakers:
Christopher Hinkle, University of Notre Dame, US
Michael Niemier, University of Notre Dame, US
Milind Weling, EMD Electronics, US
Shimeng Yu, Georgia Tech, US
Wonbo Shim, Seoul National University of Science and Technology, KR
Thomas Kämpfe, Fraunhofer IPMS, DE
Laurent Grenouillet, CEA-Leti, FR
Jeronimo Castrillon, TU Dresden, DE
Stefan Slesazeck, NaMLab, DE
Hussam Amrouch, TU Munich, DE

This workshop will speak to research with respect to ferroelectrics at all levels of the design stack.

It will begin by discussing ferroelectric device concepts (e.g., front end of line (FEOL) and back end of line (BEOL) ferroelectric field effect transistors (FeFETs), ferro-based NAND, FeRAM, etc. as well as modeling efforts).
Talks will effectively consider ferroelectrics from the “bottom-up” by addressing (a) how materials-based design-levers may influence device behavior (i.e., how might we optimize a device for a figure of merit that most benefits an application-level workload) and (b) new research in AI-guided materials discovery.
Subsequently, talks will consider the use of ferroelectric devices in novel circuits and/or memory architectures (e.g., associative memories, crossbar-based structures, and ferroelectric solutions where computation is done via charge sharing).
Novel algorithmic solutions based on ferroelectric devices, as well as how one might develop compiler support for technology-enabled, IMC solutions will also be discussed.

This workshop will include a submission-based poster-session to maximize engagement from the DATE community.

W06.1 Ferroelectric Device Concepts, Modeling, and Materials

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 16:00 CEST
Location / Room: St Clair 2

Session chair:
Michael Niemier, University of Notre Dame, US

This session begins with discussions of various ferroelectric device concepts including front-end-of-line (FEOL) and back-end-of-line (BEOL) ferroelectric field effect transistors (FeFET), ferro-based NAND memory, ferroelectric random access memory (FeRAM), and ferroelectric tunneling junctions (FTJs). Modeling efforts, as well as how artificial intelligence might be used for material science-based design space explorations will also be discussed.

Time	Label	Presentation Title Authors
14:00 CEST	W06.1.1	ENABLING AI COMPUTING APPLICATIONS WITH NOVEL FERROELECTRIC DEVICES Speaker: Milind Weling, EMD Electronics, US Abstract Artificial intelligence (AI) is clearly a transformative force reshaping our technological landscape and igniting a surge of interest in disruptive innovations across all levels of abstraction. As we stand on the brink of an AI revolution, the demand for advanced computing capabilities is skyrocketing. Yet, traditional memory solutions like on-chip SRAM and off-chip DRAM are struggling to keep up, creating a critical bottleneck that can impede the AI juggernaut. Imagine the possibilities if we could achieve over 100X improvements in memory density, bandwidth, latency, performance, and energy efficiency! This isn't just a dream—it's an urgent necessity for the future of AI. Enter ferroelectric devices, which hold incredible potential when co-optimized for key parameter indices (KPIs) across system, design, device, technology and materials. This talk will explore the exciting opportunities and formidable challenges that lie ahead as we transition from established memory devices, technologies and materials to novel ones. Join this journey as we envision a future where memory technology not only supports but accelerates the AI revolution!
14:20 CEST	W06.1.2	COMPUTING WITH FERRO-BASED NAND Speaker: Wonbo Shim, Seoul National University of Science and Technology, KR Abstract Ferroelectric device has been widely investigated as a candidate to replace the charge trap device in NAND flash owing to its low operating voltage, high speed, and structural similarity to conventional cell. Moreover, it has potential to be utilized for the compute-in-memory applications targeting energy-efficient processing of ultra-large AI models. In this talk, the current device-level research progresses on ferroelectric NAND (FeNAND) will be presented briefly, then the array-level NAND cell simulator designed to assess the feasibility of ferroelectric cell will be introduced. Following this, its potential and applicability to compute-in-memory architectures will be discussed.
14:40 CEST	W06.1.3	FERAM Speaker: Laurent Grenouillet, CEA-Leti, FR Abstract Perovskite-based Ferroelectric Random Access Memories (FeRAM) cannot scale beyond 130nm and offer poor CMOS compatibility. The discovery of hafnia-zirconia-based films changed FeRAM paradigm about 15 years ago. This talk will cover HZO-based FeRAM demonstrations from 130nm down to 22nm node, highlighting the opportunities and challenges related to this promising technology.
15:00 CEST	W06.1.4	PROSPECTS OF FERROELECTRIC TUNNELING JUNCTIONS Speaker: Stefan Slesazeck, NaMLab, DE Abstract Ferroelectric tunneling junctions (FTJ) are 2-terminal non-volatile memory devices, that consist of an active ferroelectric layer or multi-layer stack which is sandwiched between two metallic or semiconducting electrodes. In these devices the non-destructive read operation bases on the modulation of the tunneling current by the polarization state of the ferroelectric layer. Due to their high-impedance and rectifying properties FTJs are interesting candidates for the implementation of selector-less passive cross-bar arrays and for massive parallel readout for the realization of MVM in scalable selector-less passive cross-bar arrays. In this talk I will introduce the concept of the FTJ devices and discuss the prospects for their adoption in memory applications and beyond.
15:20 CEST	W06.1.5	MODELING FERROELECTRIC DEVICES Speaker: Hussam Amrouch, TU Munich, DE Abstract Ferroelectric Field-Effect Transistors (FeFETs) are a promising technology with immense potential for in-memory computing and AI acceleration. However, modeling their reliability remains a significant challenge due to multiple sources of variability. Design-time variability from process variations, run-time fluctuations driven by temperature effects, and the inherent stochasticity of ferroelectric domain switching—rooted in its probabilistic nature—make accurate reliability prediction highly complex. Without robust reliability models, it is impossible to ensure the accuracy and dependability of FeFET-based AI accelerator systems, which directly impacts the precision and effectiveness of AI algorithms. This talk presents a holistic framework for reliability estimation, seamlessly integrating insights from device physics to circuit-level analysis. We also highlight the transformative role of deep learning in addressing these challenges, demonstrating how it enables precise reliability modeling and unlocks the full potential of FeFET technology for next-generation computing.
15:40 CEST	W06.1.6	AI GUIDED MATERIALS DISCOVERY Speaker: Christopher Hinkle, University of Notre Dame, US Abstract This talk will discuss strategies to implement an accelerated discovery and codesign platform for efficient design and discovery of new ferroelectric materials and their properties in relevant devices. Achieving this goal requires moving beyond conventional, linear approaches to materials discovery, transforming them into a cyclic and iterative process integrating computation, experiment, and theory to formulate the processing-structure-property-performance relationships necessary to advance ferroelectric materials and devices. We will describe our progress in using machine learning to automate and accelerate materials characterization leading to adaptive learning for simulation and high-throughput synthesis and characterization.

W06.3 Panel Discussion

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:00 CEST - 16:20 CEST
Location / Room: St Clair 2

Panel Chair:
Michael Niemier, University of Notre Dame, US

Panellists:
Milind Weling, EMD Electronics, US
Christopher Hinkle, University of Notre Dame, US
Hussam Amrouch, TU Munich, DE
Gaurav Thareja, Applied Materials, US

Talks in the first session will be followed by a short panel discussion that explores how the design automation community can better engage with researchers at lower levels of the device stack (e.g., materials/devices) and vice versa.

W06.2 Break

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:20 CEST - 16:40 CEST
Location / Room: St Clair 2

Programme Chair:
Ian O'Connor, École Centrale de Lyon, FR

W06.4 Architectures, Applications, and Compilation Techniques for Ferroelectric Devices

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:40 CEST - 18:00 CEST
Location / Room: St Clair 2

Session chair:
Ian O'Connor, École Centrale de Lyon, FR

This session begins by considering in-memory computing (IMC) solutions based on ferroelectric device concepts. Recent work with respect to ferroelectric content addressable memories, crossbar arrays, as well as charge sharing architectures (that can also perform associative memory and MAC operations) will all be discussed. The session concludes with a discussion of recent work relating to compilation techniques and higher-level programming abstractions for ferroelectric IMC solutions.

Time	Label	Presentation Title Authors
16:40 CEST	W06.4.1	DESIGNS AND APPLICATIONS FOR FERROELECTRIC CONTENT ADDRESSABLE MEMORIES Speaker: Michael Niemier, University of Notre Dame, US Abstract Multiple research vectors represent possible paths to improved energy and performance metrics at the application-level. There are active efforts with respect to emerging logic devices, new memory technologies, novel interconnects, and heterogeneous integration architectures. Of great interest is quantifying the potential impact of a given solution to prioritize research vectors accordingly. Ideally, any such comparisons should be made to state-of-the-art/scaled CMOS solutions in an application-level context. In this presentation, students from SUPREME PI Michael Niemier’s group consider how ferroelectric-based associative memories may be employed for different workloads, and present layout-based analysis to show how said solutions may ultimately compare to highly scaled CMOS solutions across different figures of merit. Directions and suggestions for future efforts at the algorithmic, layout/chip-design, and materials science-levels – to derive maximum benefits from technology – will also be briefly discussed.
17:00 CEST	W06.4.2	FERROELECTRIC NONVOLATILE CAPACITOR (NVCAP) FOR CHARGE DOMAIN COMPUTE-IN-MEMORY Speaker: Shimeng Yu, Georgia Tech, US Abstract Non-volatile ferroelectric capacitor (nvCap) that leverages the small-signal non-destructive read is a new concept to the ferroelectric memory family. nvCap overcomes the endurance limitation imposed by the destructive read in conventional ferroelectric random access memory (FeRAM) that relies on large-signal polarization switching. nvCap is also a promising candidate to enable the charge domain computation in a capacitive crossbar array for in-memory computing that only consumes dynamic power. The key engineering goal of nvCap is to optimize a asymmetric C-V characteristics to open up the large capacitance on/off ratio at DC zero voltage. In this talk, we present the progresses of our work on optimizing the nvCap device. We first introduce the HZO-based MFM nvCap that demonstrates the proof-of-concept, and present the FeFET-based MFS nvCap that improves capacitance on/off ratio with reliability/scaling analysis. Finally we report our new results on BEOL-compatible MFS nvCap based on a oxide semiconductor layer.
17:20 CEST	W06.4.3	CHARGE SHARING ARCHITECTURES WITH FERROELECTRIC DEVICES Speaker: Thomas Kämpfe, Fraunhofer IPMS, DE Abstract This talk introduces the possibility of charge-domain computing using a 1FeFET-1Capacitance (1F1C) macro based on a 2-bit ferroelectric field-effect transistor (FeFET). This cell operating in the charge domain is marking a significant advancement for compute-in-memory (CIM) which improves the energy efficiency but also robustness due to the low capacitor mismatch. Traditionally, NVMs, such as FeFETs or resistive RAMs (RRAMs), have operated in a single-bit fashion, limiting their computational density and throughput. In contrast, the proposed 2-bit FeFET cell enables higher storage density and improves the computational efficiency in CIM architectures. The macro achieves 111.6 TOPS/W, highlighting its energy efficiency, and demonstrates robust performance on the CIFAR-10 dataset, achieving 89% accuracy with a VGG-8 neural network. These findings underscore the potential of charge-domain, multilevel NVM cells in pushing the boundaries of artificial intelligence (AI) acceleration and energy-efficient computing.
17:40 CEST	W06.4.4	COMPILER SUPPORT FOR FERROELECTRIC COMPUTE-IN-MEMORY SOLUTIONS (AND BEYOND) Speaker: Jeronimo Castrillon, TU Dresden, DE Abstract Compute-in-Memory (CIM) is a promising non-von Neumann computing paradigm that promises unprecedented improvements in performance and energy efficiency. Moving past manual designs, automation will be key to unleash the potential of CIM for multiple application domains and to accelerate cross-layer design cycles. This talks reports on an ongoing effort to build a high-level compiler infrastructure for different CIM approaches, built with MLIR to abstract from individual technologies to foster re-use. This includes abstractions and optimizations flows for logic-in memory, content-addressable memories, arithmetic operations in crossbars, and near-memory architectures. We also report on recent results retargeting the compiler for novel ferroelectric cells, exploring different memory modalities.

W08 ASD Workshop “How to supervise autonomy?”

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 14:00 CEST - 18:00 CEST
Location / Room: Roseraie 1&2

Abstract: Autonomous systems are on their way from an exotic system species to a mainstream technology, where they even reach safety critical and high assurance applications. Yet, efficient design concepts providing the required behavioral guarantees while keeping the benefits of autonomous intelligence are still an open topic, in theory and even more so in engineering practice. Some approaches rely on centralized guidance via infrastructure; others extend individual component capabilities by protective, often model based, functions. Another differentiation is the use of human support in unclear situations, such as in level 4 vehicle automation, vs. independent management with function degradation (e.g. safety layers). The workshop plans to provide examples from very different areas, such as road traffic, UAVs, human assistive robotics, and facility management. The design concepts have a high societal and economic relevance, including legal aspects, such as certification and liability.

Workshop Organization:

After an introduction, the workshop will start with introductory talks by experts from industry and academia from various domains working on different autonomous systems solutions. The session will continue with a panel discussion with these and further experts reflecting the different approaches to autonomy supervision.

Session 1: Introductory talks

Talk 1: Safeguarding Autonomous Systems: Industrial Practices and Challenges

Presenter: Dirk Ziegenbein, Bosch Corporate Research

Talk 2: Simulation Platform for Agile Governance of Mobile Robots and Drones in University Campuses

Presenters: Hiroyuki Tomiyama, Ritsumeikan University, and Norikazu Hayashi, NEXTY Electronics

Talk 3: Legal Outcomes as System Design Features and the Need for Automation Specific Law Reform

Presenter: William Widen, School of Law, University of Miami

Talk 4: Autonomous systems for the visual impaired – the supervision challenge

Presenter : Aman Malhotra, CTO & Co-Founder, NorrSpect AI

Session 2: Panel

Dirk Ziegenbein, Bosch Corporate Research, DE
Bill Widen, School of Law, University of Miami, US
Hiroyuki Tomiyama, Ritsumeikan University, JP
Marilyn Wolf, U Nebraska-Lincoln, US
Arwed Schmidt, EasyMile, DE

FS02 Focus Session - AI-Driven Design Evolution: Benchmarking and Infrastructure for the Next Era of Semiconductors and Photonics

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Auditorium Pasteur

Session chair:
Anthony Agnesina, NVIDIA Corp., US

Session co-chair:
Hao Geng, Shanghai Tech University, CN

Organisers:
Haoyu Yang, NVIDIA Corp., US
Yuzhe Ma, Hong Kong University of Sicence and Technology (GZ), CN

"As AI and machine learning models become increasingly integrated into semiconductor and photonic design workflows, the need for rigorous benchmarking, robust datasets, and scalable infrastructures is paramount. This special session presents pioneering research on evaluating AI capabilities across digital hardware, formal verification, and photonic device design, with a strong focus on the importance of benchmark frameworks and dataset development.The session will feature four key talks:1) ChipVQA is a benchmark designed to evaluate visual language models (VLMs) in chip design, requiring a visual understanding of diagrams and schematics across five disciplines. Current models, including GPT-4o, struggle with domain-specific tasks, while a novel agent-based approach shows potential for improved performance.2) FVEval is a comprehensive benchmark designed to evaluate large language models (LLMs) in formal verification tasks for digital chip design. It assesses LLMs' abilities to generate SystemVerilog assertions and reason about design RTL. The benchmark includes both expert-written and synthetic examples, offering insights into current LLM capabilities and potential for improving formal verification productivity.3) MAPS introduces an open-source infrastructure to standardize AI-based solvers for photonic device simulation and inverse design. It provides a rich dataset, a neural operator model zoo for training, and a scalable framework for benchmarking AI-based photonic simulators. MAPS aims to accelerate innovation in photonic hardware by bridging the gap between AI-driven physics simulations and photonic design optimization.4) PICEval introduces a benchmark to evaluate large language models (LLMs) for automating the design of photonic integrated circuits (PICs). The benchmark spans device- to circuit-level designs and assesses the functionality and fidelity of LLM-generated netlists by comparing them to expert-written solutions. It highlights the challenges and potential of LLMs in automating PIC design and identifies areas for further research to optimize their application.Together, these talks underscore the crucial role of benchmarks, datasets, and scalable infrastructure in advancing AI for chip and photonic design, shaping the future of automated and intelligent design workflows."

Time	Label	Presentation Title Authors
16:30 CEST	FS02.1	CHIPVQA: BENCHMARKING VISUAL LANGUAGE MODELS FOR CHIP DESIGN Speaker: Haoyu Yang, NVIDIA Corp., US Authors: Haoyu Yang, Qijing Huang, Nathaniel Pinckney, Walker Turner, Wenfei Zhou, Yanqing Zhang, Chia-Tung Ho, Chen-Chia Chang and Haoxing Ren, NVIDIA Corp., US Abstract Large-language models (LLMs) have shown great potential in assisting chip design and analysis, with recent research focusing primarily on text-based tasks such as general QA, debugging, and design tool scripting. However, the chip design and implementation workflow often requires a visual understanding of diagrams, flowcharts, graphs, schematics, waveforms, and more, necessitating the development of multi-modality foundation models. To address this gap, we propose ChipVQA, a benchmark designed to evaluate the capability of visual language models (VLMs) for chip design. ChipVQA comprises 142 carefully crafted and collected VQA questions spanning five chip design disciplines: Digital Design, Analog Design, Architecture, Physical Design, and Semiconductor Manufacturing. Unlike existing VQA benchmarks, ChipVQA questions are meticulously created by chip design experts and require in-depth domain knowledge and reasoning to solve. Our comprehensive evaluations on both open-source and proprietary multi-modal models reveal significant challenges posed by the benchmark suite, with existing VLMs struggling to meet the demands of chip design knowledge and reasoning. Notably, GPT-4o achieves only a 44% correctness rate. Additionally, we conducted a preliminary study on an alternative VLM inference methodology using an agent, which showed improved performance in certain categories without additional training, highlighting the potential of leveraging LLM agents as an alternative approach for VLM deployment in chip design.
16:53 CEST	FS02.2	FVEVAL: UNDERSTANDING LANGUAGE MODEL CAPABILITIES IN FORMAL VERIFICATION OF DIGITAL HARDWARE Speaker: Minwoo Kang, University of California, Berkeley, US Authors: Minwoo Kang¹, Mingjie Liu², Ghaith Bany Hamad², Syed Suhaib² and Haoxing Ren² ¹University of California, Berkeley, US; ²NVIDIA Corp., US Abstract The remarkable reasoning and code generation capabilities of large language models (LLMs) have spurred significant interest in applying LLMs to enable task automation in digital chip design. In particular, recent work has investigated early ideas of applying these models to formal verification (FV), an approach to verifying hardware implementations that can provide strong guarantees of confidence but demands significant amounts of human effort. While the value of LLM-driven automation is evident, our understanding of model performance, however, has been hindered by the lack of holistic evaluation. In response, we present FVEval, the first comprehensive benchmark and evaluation framework for characterizing LLM performance in tasks pertaining to FV. The benchmark consists of three sub-tasks that measure LLM capabilities at different levels---from the generation of SystemVerilog assertions (SVAs) given natural language descriptions to reasoning about the design RTL and suggesting assertions directly without ad
17:15 CEST	FS02.3	MAPS: MULTI-FIDELITY AI-AUGMENTED PHOTONIC SIMULATION AND INVERSE DESIGN INFRASTRUCTURE Speaker: Haoyu Yang, Nvidia Inc, US Authors: Pingchuan Ma¹, Zhengqi Gao², Meng Zhang³, Haoyu Yang⁴, Haoxing Ren⁴, Rena Huang³, Duane Boning² and Jiaqi Gu¹ ¹Arizona State University, US; ²Massachusetts Institute of Technology, US; ³Rensselaer Polytechnic Institute, US; ⁴NVIDIA Corp., US Abstract "Inverse design has become a powerful approach in photonic device optimization, enabling access to high-dimensional, non-intuitive design spaces that lead to ultra-compact devices with superior performance, ultimately advancing the development of high-density photonic integrated circuits (PICs). The adjoint method plays a key role in this process by efficiently computing both the figure of merit (FoM) and its analytical gradient with only two simulations, enabling gradient-based device topology optimization. However, a significant computational bottleneck remains, i.e., the reliance on solving partial differential equations (PDEs) or eigenvalue problems within simulation-in-the-loop optimization frameworks, which hinders scalability. Recent advancements in AI-based solvers offer a promising solution by accelerating the solving of these PDEs and eigenvalue problems, enabling faster and more scalable inverse design processes. Despite these advancements, a major challenge persists—the absence of an open-source, standardized, widely available infrastructure and dataset for training and benchmarking AI-based PDE solvers tailored to photonic hardware. In this work, we introduce MAPS (Multi-Fidelity AI-Augmented Photonic Simulation and Inverse Design Benchmarking Infrastructure) to fill this gap. MAPS features: 1. MAP-Data: A photonic device dataset that covers a broad design space of representative device types, capturing both high- and low-performance designs. The dataset integrates multi-modal inputs (structure, light source, etc.) and physically significant evaluation metrics (FoMs and light fields, etc.), offering a rich data source for AI-based photonic simulation research. 2. MAPS-Train: A standardized AI-for-photonics neural operator model zoo and training framework, featuring extensible configurations and seamless integration with MAPS-Data pipelines, facilitating fair comparisons and standardized benchmarking of AI-based, physics-inspired photonic simulators. 3. MAPS-InvDes: An advanced adjoint method-based inverse design infrastructure that abstracts complex physical details, making it accessible to both computer-aided design (CAD) and machine learning (ML) communities. It integrates seamlessly with pre-trained AI-based PDE solvers and incorporates customized fabrication variation models (e.g., differentiable lithography and etching) to validate practical applicability in real-world inverse design tasks. This infrastructure MAPS bridges the gap between AI-for-physics and photonic device design by providing a standardized, open-source platform for developing and benchmarking AI-based solvers, ultimately accelerating innovation in both photonic hardware optimization and scientific ML."
17:38 CEST	FS02.4	PICBENCH: BENCHMARKING LLMS FOR PHOTONIC INTEGRATED CIRCUITS DESIGN Speaker: Yuchao Wu, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Yuchao Wu¹, Xiaofei Yu¹, HAO CHEN¹, Yang Luo¹, Yeyu Tong² and Yuzhe Ma¹ ¹The Hong Kong University of Science and Technology (Guangzhou), CN; ²The Hong Kong University of Science and Technology (Guangzhou)), CN Abstract While large language models (LLMs) have shown remarkable potential in automating various tasks in digital chip design, the field of Photonic Integrated Circuits (PICs)—a promising solution to advanced chip designs—remains relatively unexplored in this context. The design of PICs is time-consuming and prone to errors due to the extensive and repetitive nature of code involved in photonic chip design. In this paper, we introduce PICBench, the first benchmarking and evaluation framework specifically designed to automate PIC design generation using LLMs, where the generated output takes the form of a netlist. Our benchmark consists of dozens of meticulously crafted PIC design problems, spanning from fundamental device designs to more complex circuit-level designs. It automatically evaluates both the syntax and functionality of generated PIC designs by comparing simulation outputs with expert-written solutions, leveraging an open-source simulator. We evaluate a range of existing LLMs, while also conducting comparative tests on various prompt engineering techniques to enhance LLM performance in automated PIC design. The results reveal the challenges and potential of LLMs in the PIC design domain, offering insights into the key areas that require further research and development to optimize automation in this field.

TS13 Embedded, Real-Time and Dependable Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Salon Pasteur

Session chair:
An Zou, Shanghai Jiao Tong University, CN

Session co-chair:
Yanjing Li, Univesity of Chicago, US

Time	Label	Presentation Title Authors
16:30 CEST	TS13.1	HARDWARE-ASSISTED RANSOMWARE DETECTION USING AUTOMATED MACHINE LEARNING Speaker: Zhixin Pan, Zhixin Pan, US Authors: Zhixin Pan¹ and Ziyu Shu² ¹Florida State University, US; ²Washington University in St. Louis, US Abstract Ransomware has emerged as a severe privacy threat, leading to significant financial and data losses worldwide. Traditional detection methods, including static signature-based detection and dynamic behavior-based analysis, have shown limitations in effectively identifying and mitigating ever-evolving ransomware attacks. In this paper, we present a machine learning-based framework with hardware-level microprocessor activity monitoring to enhance detection performance. Specifically, the proposed method incorporates adversarial training to address the weaknesses of conventional static analysis against obfuscation, along with a hardware-assisted behavior monitoring to reduce latency, achieving effective and realtime ransomware detection. The proposed method employs a Neural Architecture Search (NAS) algorithm to automate the selection of optimal machine learning models, significantly boosting generalizability. Experimental results demonstrates that our proposed method improves detection accuracy and reduces detection latency compared to existing approaches, while also maintaining a high generalizability across diverse ransomware types.
16:35 CEST	TS13.2	RICH: HETEROGENEOUS COMPUTING FOR REAL-TIME INTELLIGENT CONTROL SYSTEMS Speaker: Jintao Chen, Shanghai Jiao Tong University, CN Authors: Jintao Chen, Yuankai Xu, Yinchen Ni, An Zou and Yehan Ma, Shanghai Jiao Tong University, CN Abstract Over the past years, intelligent control tasks, such as deep neural networks (DNNs), have demonstrated significant potential in control systems. However, deploying intelligent control policies on heterogeneous computing platforms presents open challenges. These challenges extend beyond the apparent conflict between intensive computation and timing constraints and further encompass the interactions between task executions and complicated control performance. To address these challenges, this paper introduces RICH, a general and end-to-end approach to facilitate intelligent control tasks on heterogeneous computing architectures. RICH incorporates both offline Control-Oriented Computation and Resource Mapping (CCRM) and runtime Most Remaining Accelerator Segment Number First Scheduling (MRAF). Given the control tasks, the CCRM starts with balancing the computation workloads and processor resources with the goal of optimizing overall control performance. Subsequently, the MRAF employs segment-level real-time scheduling to ensure the timely execution of tasks. Extensive experiments on the robotic arms (by hardware-in-the-loop simulator) demonstrate that the RICH can work as a general and end-to-end approach. These experiments reveal significant improvements in control performance, with enhancements of 50.7% observed for intelligent control applications deployed on heterogeneous computing platforms.
16:40 CEST	TS13.3	RT-VIRTIO: TOWARDS THE REAL-TIME PERFORMANCE OF VIRTIO IN A TWO-TIER COMPUTING ARCHITECTURE Speaker: Siwei Ye, Shanghai Jiao Tong Univerisy, CN Authors: Siwei Ye¹, Minqing Sun¹, Huifeng Zhu², Yier Jin³ and An Zou¹ ¹Shanghai Jiao Tong University, CN; ²Washington University in St. Louis, US; ³University of Science and Technology of China, CN Abstract With the popularity of virtualization technology, ensuring reliable I/O operations with timing constraints in virtual environments becomes increasingly critical. Timing-predictable virtual I/O enhances the responsiveness and efficiency of virtualized systems, facilitating their seamless integration into time-critical applications such as industrial automation and robotics. Its significance lies in meeting rigorous performance standards, minimizing latency, and consistently delivering predictable I/O performance. As a result, virtual machines can effectively support mission-critical and time-sensitive workloads. However, due to the complicated system architecture, the IO operations in the virtualization face competition from the IO operations in the same virtual machine and the interaction from different virtual machines of which the I/O goes to the same host machine. This study presents RT-VirtIO, a practical approach to provide predictable real-time IO operations. RT-VirtIO addresses the challenges associated with lengthy data paths and complex resource management. Through early-stage characterization, this study identifies key factors contributing to poor I/O real-time performance and then builds an analytical model and a learning-based data-driven model to predict the tail I/O latency. Leveraging these two models, RT-VirtIO effectively captures these dynamics, enabling the development of a general and applicable optimization framework. Experimental results demonstrate that RT-VirtIO significantly improves real-time performance in virtual environments (by 20.07%~30.90%) without necessitating hardware modifications, which exhibit promising applicability across a broader range of scenarios.
16:45 CEST	TS13.4	ENABLING SECURITY ON THE EDGE: A CHERI COMPARTMENTALIZED NETWORK STACK Speaker: Donato Ferraro, University of Modena and Reggio Emilia, Minerva Systems, IT Authors: Donato Ferraro¹, Andrea Bastoni², Alexander Zuepke³ and Andrea Marongiu⁴ ¹Minerva Systems SRL, University of Modena and Reggio Emilia, IT; ²TUM, Minerva Systems, DE; ³TU Munich, DE; ⁴Università di Modena e Reggio Emilia, IT Abstract The widespread deployment of embedded systems in critical infrastructures, interconnected edge devices like autonomous drones, and smart industrial systems requires robust security measures. Compromised systems increase the risks of operational failures, data breaches, and---in safety-critical environments---potential physical harm to people. Despite these risks, current security measures are often insufficient to fully address the attack surfaces of embedded devices. CHERI provides strong security from the hardware level by enabling fine-grained compartmentalization and memory protection, which can reduce the attack surface and improve the reliability of such devices. In this work, we explore the potential of CHERI to compartmentalize one of the most critical and targeted components of interconnected systems: their network stack. Our case study examines the trade-offs of isolating applications, TCP/IP libraries, and network drivers on a CheriBSD system deployed on the Arm Morello platform. Our results suggest that CHERI has the potential to enhance security while maintaining performance in embedded-like environments.
16:50 CEST	TS13.5	TOWARDS RELIABLE SYSTEMS: A SCALABLE APPROACH TO AXI4 TRANSACTION MONITORING Speaker: Chaoqun Liang, Università di Bologna, IT Authors: Chaoqun Liang¹, Thomas Benz², Alessandro Ottaviano², Angelo Garofalo¹, Luca Benini¹ and Davide Rossi¹ ¹Università di Bologna, IT; ²ETH Zurich, CH Abstract In safety-critical SoC applications such as automotive and aerospace, reliable transaction monitoring is crucial for maintaining system integrity. This paper introduces a drop-in Transaction Monitoring Unit (TMU) for AXI4 subordinate endpoints that detects transaction failures including protocol violations or timeouts and triggers recovery by resetting the affected subordinates. Two TMU variants address different constraints: a Tiny-Counter solution for tightly area-constrained systems and a Full-Counter solution for critical subordinates in mixed-criticality SoCs. The Tiny-Counter employs a single counter per outstanding transaction, while the Full-Counter uses multiple counters to track distinct transaction stages, offering finer-grained monitoring and reducing detection latencies by up to hundreds of cycles at roughly 2.5× the area cost. The Full-Counter also provides detailed error logs for performance and bottleneck analysis. Evaluations at both IP and system levels confirm the TMU's effectiveness and low overhead. In GF12 technology, monitoring 16–32 outstanding transactions occupies 1330–2616 µm2 for the tiny-Counter and 3452–6787 µm2 for the Full-Counter; moderate prescaler steps reduce these figures by 18–39% and 19–32%, respectively, with no loss of functionality. Results from a full-system integration demonstrate the TMU's robust and precise monitoring capabilities in safety-critical SoC environments.
16:55 CEST	TS13.6	EXACT SCHEDULABILITY ANALYSIS FOR LIMITED-PREEMPTIVE PARALLEL APPLICATIONS USING TIMED AUTOMATA IN UPPAAL Speaker: Jonas Hansen, Aalborg Universitet, DK Authors: Jonas Hansen¹, Srinidhi Srinivasan², Geoffrey Nelissen³ and Kim Larsen¹ ¹Aalborg Universitet, DK; ²Technische Universiteit Eindhoven (TU/e), NL; ³Eindhoven University of Technology, NL Abstract We study the problem of verifying schedulability and ascertaining response time bounds of limited-preemptive parallel applications with uncertainty, scheduled on multi-core platforms. While sufficient techniques exist for analysing schedulability and response time of parallel applications under fixed-priority scheduling, their accuracy remains uncertain due to the lack of a scalable and exact analysis that can serve as a ground-truth to measure the pessimism of existing sufficient analyses. In this paper, we address this gap using formal methods. We use Timed Automata and the powerful UPPAAL verification engine to develop a generic approach to model parallel applications and provide a scalable and exact schedulability and response time analysis. This work establishes a benchmark for evaluating the accuracy of both existing and future sufficient analysis techniques. Furthermore, our solution is easily extendable to more complex task models thanks to its flexible model architecture.
17:00 CEST	TS13.7	MONOMORPHISM-BASED CGRA MAPPING VIA SPACE AND TIME DECOUPLING Speaker: Cristian Tirelli, Università della Svizzera italiana, CH Authors: Cristian Tirelli, Rodrigo Otoni and Laura Pozzi, Università della Svizzera italiana, CH Abstract Coarse-Grain Reconfigurable Arrays (CGRAs) provide flexibility and energy efficiency in accelerating compute-intensive loops. Existing compilation techniques often struggle with scalability, unable to map code onto large CGRAs. To address this, we propose a novel approach to the mapping problem where the time and space dimensions are decoupled and explored separately. We leverage an SMT formulation to traverse the time dimension first, and then perform a monomorphism-based search to find a valid spatial solution. Experimental results show that our approach achieves the same mapping quality of state-of-the-art techniques while significantly reducing compilation time, with this reduction being particularly tangible when compiling for large CGRAs. We achieve approximately 10^5x average compilation speedup for the benchmarks evaluated on a 20x20 CGRA.
17:05 CEST	TS13.8	ATTENTIONLIB: A SCALABLE OPTIMIZATION FRAMEWORK FOR AUTOMATED ATTENTION ACCELERATION ON FPGA Speaker: Zhenyu Liu, Fudan University, CN Authors: Zhenyu Liu, Xilang Zhou, Faxian Sun, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN Abstract The self-attention mechanism is a fundamental component within transformer-based models. Nowadays, as the length of sequences processed by large language models (LLMs) continues to increase, the attention mechanism has gradually become a bottleneck in model inference. The LLM inference process can be separated into two phases: prefill and decode. The latter contains memory-intensive attention computation, making FPGA-based accelerators an attractive solution for acceleration. However, designing accelerators tailored for the attention module poses a challenge, requiring substantial manual work. To automate this process and achieve superior acceleration performance, we propose AttentionLib, an MLIR-based framework. AttentionLib automatically performs fusion dataflow optimization for attention computations and generates high-level synthesis code in compliance with hardware constraints. Given the large design space, we provide a design space exploration (DSE) engine to automatically identify optimal fusion dataflows within the specified constraints. Experimental results show that AttentionLib is effective in generating well-suited accelerators for diverse attention computations and achieving superior performance under hardware constraints. Notably, the accelerators generated by AttentionLib exhibit at least a 25.1× improvement compared to the baselines solely automatically optimized by Vitis HLS. Furthermore, these designs outperform GPUs in decode workloads, showcasing over a 2× speedup for short sequences.
17:10 CEST	TS13.9	ENSURING DATA FRESHNESS FOR IN-STORAGE COMPUTING WITH COOPERATIVE BUFFER MANAGER Speaker: Yang Guo, The Chinese University of Hong Kong, HK Authors: Jin Xue, Yuhong Song, Yang Guo and Zili Shao, The Chinese University of Hong Kong, HK Abstract In-storage computing (ISC) aims to mitigate the excessive data movement between the host memory and storage by offloading computation to storage devices for in-situ execution. However, ensuring data freshness remains a key challenge for practical ISC. For performance considerations, many data processing systems implement a buffer manager to cache part of the on-disk data in the host memory. While the host applications commit updates to the in-memory cached copies of the data, ISC operators offloaded to the device only have access to the on-disk persistent data. Thus, ISC may miss the most recent updates from the host and produce incorrect results after reading the stale and inconsistent data from the persistent storage. With this limitation, current ISC can only be used in read-only settings where the on-disk data are not subject to concurrent updates. To tackle this problem, we propose a cooperative buffer manager for ISC to transparently provide data freshness guarantees to host applications. Proposed methods allow the device to synchronize with the host buffer manager and decide whether to read the most recent copy of data from host memory or flash memory. We implement our method based on a real hardware platform and perform evaluation with a B+-tree based key-value store. Experiments show that our method can provide transparent data freshness for host applications with reduced latency.
17:15 CEST	TS13.10	EVALUATING COMPILER-BASED RELIABILITY WITH RADIATION FAULT INJECTION Speaker: Davide Baroffio, Politecnico di Milano, IT Authors: Davide Baroffio, Tomas López, Federico Reghenzani and William Fornaciari, Politecnico di Milano, IT Abstract Compiler-based fault tolerance is a cost-effective and flexible family of solutions that transparently improves software reliability. This paper evaluates a compiler tool for fault detection via laser injection and $alpha$-particle exposure. A novel memory allocation strategy is proposed to mitigate the effects of multi-bit upsets. We integrated the detection mechanism with a recovery solution based on mixed-criticality scheduling. The results demonstrate the error detection and recovery capabilities in realistic scenarios: reducing undetected errors, enhancing system reliability, and advancing software-implemented fault tolerance.
17:16 CEST	TS13.11	UMBRA: AN EFFICIENT FRAMEWORK FOR TRUSTED EXECUTION ON MODERN TRUSTZONE-ENABLED MICROCONTROLLERS Speaker: Stefano Mercogliano, Università di Napoli Federico II, IT Authors: Stefano Mercogliano¹ and Alessandro Cilardo² ¹Università di Napoli Federico II, IT; ²University of Naples, Federico II, IT Abstract The rise of microcontrollers in critical systems demands robust security measures beyond traditional methods like Memory Protection Units. ARM's TrustZone-M offers enhanced protection for secure applications, yet its potential for deploying Trusted Execution Environments often remains untapped, leaving room for innovation in managing security on resource-constrained devices. This paper presents Umbra, a Rust-based framework that isolates mutually distrustful applications and integrates with untrusted embedded OSes. Leveraging modern security hardware, Umbra features an efficient secure caching mechanism that encrypts all code exposed to attackers, decrypting and validating only necessary blocks during execution, achieving practical Trusted Execution Environments on modern microcontrollers. Index Terms—ARM TrustZone-M, Trusted Execution Environment, Rust for Secure Development, Lightweight Security Mechanisms
17:17 CEST	TS13.12	HARDWARE/SOFTWARE CO-ANALYSIS FOR WORST CASE EXECUTION TIME BOUNDS Speaker: Can Joshua Lehmann, Karlsruhe Institute of Technology, DE Authors: Can Lehmann¹, Lars Bauer², Hassan Nassar¹, Heba Khdr¹ and Joerg Henkel¹ ¹Karlsruhe Institute of Technology, DE; ²Independent Scholar, DE Abstract Ensuring that safety-critical systems meet timing constraints is crucial to avoid disastrous failures. To verify that timing requirements are met, a worst-case execution time (WCET) bound is computed. However, traditional WCET tools require a predefined timing model for each target processor, which is not available when using custom instruction set extensions. We introduce a novel approach based on hardware-software co-analysis that employs an instrumented hardware description of the target processor, removing the requirement for a separate timing model. We demonstrate this approach by extending the FemtoRV32 Individua RISC-V processor with a custom instruction set extension and show that it accurately models the timing behavior of the resulting system.

TS14 Architectural and microarchitectural design - 2

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Rhône 2

Session chair:
Rolf Drechsler, University of Bremen, DE

Session co-chair:
Maksim Jenihhin, Taltech, EE

Time	Label	Presentation Title Authors
16:30 CEST	TS14.1	SPARSYNERGY: UNLOCKING FLEXIBLE AND EFFICIENT DNN ACCELERATION THROUGH MULTI-LEVEL SPARSITY Speaker: Jingkui Yang, National University of Defense Technology, CN Authors: Jingkui Yang¹, Mei Wen¹, Junzhong Shen², Jianchao Yang¹, Yasong Cao¹, Jun He¹, Minjin Tang³, Zhaoyun Chen¹ and Yang Shi⁴ ¹National University of Defense Technology, CN; ²Key Laboratory of Advanced Microprocessor Chips and Systems, National University of Defense Technology, CN; ³National University of Defense Technology, Key Laboratory of Advanced Microprocessor Chips and Systems, CN; ⁴1.National Key Laboratory for Parallel and Distributed Processing, National University of Defense Technology;2.Department of Computer, National University of Defense Technology, CN Abstract To more effectively address the computational and memory requirements of deep neural networks (DNNs), leveraging multi-level sparsity---including value-level and bit-level sparsity--- has emerged as a pivotal strategy. While substantial research has been dedicated to exploring value-level and bit-level sparsity individually, the combination of both has largely been overlooked until now. In this paper, we propose SparSynergy, which---to the best of our knowledge---is the first accelerator that synergistically integrates multi-level sparsity into a unified framework, maximizing computational efficiency and minimizing memory usage. However, jointly considering multi-level sparsity is non-trivial, as it presents several challenges: (1) increased hardware overhead due to the complexity of incorporating multiple sparsity levels, (2) bandwidth-intensive data transmission during multiplexing, and (3) decreased throughput and scalability caused by bottlenecks in bit-serial computation. Our proposed SparSynergy addresses these challenges by introducing a unified sparsity format and a co-optimized hardware design. Experimental results demonstrate that SparSynergy achieves a 5.38x geometric mean improvement in the energy-delay product (EDP) when compared with the tensor core, across workloads with varying degrees of sparsity. Furthermore, SparSynergy significantly improves accuracy retention compared to state-of-the-art accelerators for representative DNNs.
16:35 CEST	TS14.2	PS-GS: GROUP-WISE PARALLEL RENDERING WITH STAGE-WISE COMPLEXITY REDUCTIONS FOR REAL-TIME 3D GAUSSIAN SPLATTING Speaker: Joongho Jo, Korea University, KR Authors: Joongho Jo and Jongsun Park, Korea University, KR Abstract 3D Gaussian Splatting (3D-GS) is an emerging rendering technique that surpasses the neural radiance field (NeRF) in both rendering speed and image quality. Despite its advantages, running 3D-GS on mobile or edge devices in real-time remains challenging due to large computational complexity. In this paper, we introduce PS-GS, a specialized low-complexity hardware designed to enhance the pipeline parallelism of 3D-GS rendering pipeline process. In this work, we first observe that 3D-GS rendering can be parallelized when the approximate order of Gaussians, from those closest to the camera to those farthest, is known ahead. But, to enhance 3D-GS rendering speed via parallel processing, an efficient viewpoint-adaptive grouping method with low computational costs is essential. Two key computational bottlenecks of viewpoint-adaptive grouping are the grouping of invisible Gaussians and depth-based sorting. For efficient group-wise parallel rendering with low complexity viewpoint-adaptive grouping, we propose three key techniques—cluster-based preprocessing, sorting, and grouping—all seamlessly incorporated into the PS-GS architecture. Our experimental results demonstrate that PS-GS delivers an average speedup of 1.20× with negligible peak signal-to-noise ratio (PSNR) degradation.
16:40 CEST	TS14.3	TXISC: TRANSACTIONAL FILE PROCESSING IN COMPUTATIONAL SSDS Speaker: Penghao Sun, Shanghai Jiao Tong University, CN Authors: Penghao Sun¹, Shengan Zheng¹, Kaijiang Deng¹, Guifeng Wang¹, Jin Pu¹, Jie Yang², Maojun Yuan², Feng Zhu², Shu Li² and Linpeng Huang¹ ¹Shanghai Jiao Tong University, CN; ²Alibaba Group, CN Abstract Computational SSDs implement the in-storage computing (ISC) paradigm and benefit applications by taking over I/O-intensive tasks from the host. Existing works have proposed various frameworks aiming at easy access to ISC functionalities, and among them generic frameworks with file-based abstractions offer better usability. However, since intermediate output by ISC tasks may leave files in a dirty state, concurrent access to and the integrity of file data should be properly managed, which has not been fully addressed. In this paper, we present TxISC, a generic ISC framework that coordinates the host kernel and device firmware to offer a versatile file-based programming model. Under the hood, TxISC turns each invocation of an ISC task into a transaction with full ACID guarantee, fully covering concurrency control and data protection. TxISC implements transactions at low cost by leveraging the out-of-place write characteristic of NAND flash. Evaluation on full-stack hardware shows that transactions incur almost no runtime performance penalty compared with existing ISC architectures. Application case studies demonstrate that the programming model of TxISC can be used to offload complex logic and deliver significant speedup over host-only solutions.
16:45 CEST	TS14.4	ARAXL: A PHYSICALLY SCALABLE, ULTRA-WIDE RISC-V VECTOR PROCESSOR DESIGN FOR FAST AND EFFICIENT COMPUTATION ON LONG VECTORS Speaker: Navaneeth Kunhi Purayil, ETH Zurich, CH Authors: Navaneeth Kunhi Purayil¹, Matteo Perotti¹, Tim Fischer¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract The ever-growing scale of data parallelism in today's HPC and ML applications presents a big challenge for computing architectures' energy efficiency and performance. Vector processors address the scale-up challenge by decoupling Vector Register File (VRF) and datapath widths, allowing the VRF to host long vectors and increase register-stored data reuse while reducing the relative cost of instruction fetch and decode. However, even the largest vector processor designs today struggle to scale to more than 8 vector lanes with double-precision Floating Point Units (FPUs) and 256 64-bit elements per vector register. This limitation is induced by difficulties in the physical implementation, which becomes wire-dominated and inefficient. In this work, we present AraXL, a modular and scalable 64-bit RISC-V V vector architecture targeting long-vector applications for HPC and ML. AraXL addresses the physical scalability challenges of state-of-the-art vector processors with a distributed and hierarchical interconnect, supporting up to 64 parallel vector lanes and reaching the maximum Vector Register File size of 64 Kibit/vreg permitted by the RISC-V V 1.0 ISA specification. Implemented in a 22-nm technology node, our 64-lane AraXL achieves a performance peak of 146 GFLOPs on computation-intensive HPC/ML kernels (>99% FPU utilization) and energy efficiency of 40.1 GFLOPs/W (1.15 GHz, TT, 0.8V), with only 3.8x the area of a 16-lane instance.
16:50 CEST	TS14.5	PERFORMANCE IMPLICATIONS OF MULTI-CHIPLET NEURAL PROCESSING UNITS ON AUTONOMOUS DRIVING PERCEPTION Speaker: Luke Chen, University of California, Irvine, US Authors: Mohanad Odema, Luke Chen, Hyoukjun Kwon and Mohammad Al Faruque, University of California, Irvine, US Abstract We study the application of emerging chiplet-based Neural Processing Units to accelerate vehicular AI perception workloads in constrained automotive settings. The motivation stems from how chiplets technology is becoming integral to emerging vehicular architectures, providing a cost-effective trade-off between performance, modularity, and customization; and from perception models being the most computationally demanding workloads in a autonomous driving system. Using the Tesla Autopilot perception pipeline as a case study, we first breakdown its constituent models and profile their performance on different chiplet accelerators. From the insights, we propose a novel scheduling strategy to efficiently deploy perception workloads on multi-chip AI accelerators. Our experiments using a standard DNN performance simulator, MAESTRO, show our approach realizes 82% and 2.8× increase in throughput and processing engines utilization compared to monolithic accelerator designs.
16:55 CEST	TS14.6	LT-OAQ: LEARNABLE THRESHOLD BASED OUTLIER-AWARE QUANTIZATION AND ITS ENERGY-EFFICIENT ACCELERATOR FOR LOW-PRECISION ON-CHIP TRAINING Speaker: Qinkai Xu, Nanjing University, CN Authors: Qinkai Xu, Yijin Liu, Yuan Meng, Yang Chen, Yunlong Mao, Li Li and Yuxiang Fu, Nanjing University, CN Abstract Low-precision training has emerged as a powerful technique for reducing computational and storage costs in Deep Neural Network (DNN) model training, enabling on-chip training or fine-tuning on edge devices. However, existing low-precision training methods often require higher bit-widths to maintain accuracy as model sizes increase. In this paper, we introduce an outlier-aware quantization strategy for low-precision training. While traditional value-aware quantization methods require costly online distribution statistics operations on computational data, impeding the efficiency gains of low-precision training, our approach addresses this challenge through a novel Learnable Threshold based Outlier-Aware Quantization (LT-OAQ) training framework. This method concurrently updates outlier thresholds and model weights through gradient descent, eliminating the need for costly data-statistics operations. To efficiently support the LT-OAQ training framework, we designed a hardware accelerator based on the systolic array architecture. This accelerator introduces a processing element (PE) fusion mechanism that dynamically fuses adjacent PEs into clusters to support outlier computations, optimizing the mapping of outlier computation tasks, enabling mixed-precision training, and implementing online quantization. Our approach maintains model accuracy while significantly reducing computational complexity and storage resource requirements. Experimental results demonstrate that our design achieves a 2.9x speedup in performance and a 2.17x reduction in energy consumption compared to state-of-the-art low-precision accelerators.
17:00 CEST	TS14.7	LIGNN: ACCELERATING GNN TRAINING THROUGH LOCALITY-AWARE DROPOUT Speaker: Gongjian Sun, SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, CN Authors: Gongjian Sun¹, Mingyu Yan², Dengke Han³, Runzhen Xue⁴, Xiaochun Ye¹ and Dongrui Fan¹ ¹SKLP, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN; ³SKLP, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; ⁴State Key Lab of Processors, Institute of Computing Technology, CAS; School of Computer Science and Technology, University of Chinese Academy of Sciences, CN Abstract Graph Neural Networks (GNNs) have demonstrated significant success in graph learning and are widely adopted across various critical domains. However, the irregular connectivity between vertices leads to inefficient neighbor aggregation, resulting in substantial irregular and coarse-grained DRAM accesses. This lack of data locality presents significant challenges for execution platforms, ultimately degrading performance. While previous accelerator designs have leveraged on-chip memory and data access scheduling strategies to address this issue, they still inevitably access features at irregular addresses from DRAM. In this work, we propose LiGNN, a hardware-based solution that enhances locality and applies dropout to aggregation to accelerate GNN training. Unlike algorithmic dropout approaches that primarily focus on improving accuracy and neglects hardware costs, LiGNN is specifically designed to drop graph features with data locality awareness, directly targeting the reduction of irregular DRAM accesses, meanwhile maintaining accuracy. LiGNN introduces locality-aware ordering and a DRAM row integrity policy, enabling configurable burst and row-granularity dropout at the DRAM level. This approach improves data locality and ensures more efficient DRAM access. Compared to state-of-the-art methods, under classic 0.5 droprate, LiGNN achieves a 1.6~2.2x speedup, reduces DRAM accesses by 44~50% and DRAM row activation by 41~82%, all without losing accuracy.
17:05 CEST	TS14.8	COUPLEDCB: ELIMINATING WASTED PAGES IN COPYBACK-BASED GARBAGE COLLECTION FOR SSDS Speaker: Jun Li, Nanjing University of Posts and Telecommunications, CN Authors: Jun Li¹, Xiaofei Xu², zhibing sha³, Xiaobai Chen¹, Jieming Yin¹ and Jianwei Liao⁴ ¹Nanjing University of Posts and Telecommunications, CN; ²RMIT University, AU; ³southwest university, CN; ⁴Southwest University of China, CN Abstract The management of garbage collection poses significant challenges in high-density NAND flash-based SSDs. The introduction of the copyback command aims to expedite the migration of valid data. However, its odd/even constraint causes wasted pages during migrations, limiting the efficiency of garbage collection. Additionally, while full-sequence programming enhances write performance in high-density SSDs, it increases write granularity and exacerbates the issue of wasted pages. To address the problem of wasted pages, we propose a novel method called CoupledCB, which utilizes coupled blocks to fill up the wasted space in copyback-based garbage collection. By taking into account the access characteristics of the candidate coupled blocks and workloads, we develop a coupled block selection model assisted by logistic regression. Experimental results show that our proposal significantly enhances garbage collection efficiency and I/O performance compared to state-of-the-art schemes.
17:10 CEST	TS14.9	LIGHTMAMBA: EFFICIENT MAMBA ACCELERATION ON FPGA WITH QUANTIZATION AND HARDWARE CO-DESIGN Speaker: Renjie Wei, Peking University, CN Authors: Renjie Wei, Songqiang Xu, Linfeng Zhong, Zebin Yang, Qingyu Guo, Yuan Wang, Runsheng Wang and Meng Li, Peking University, CN Abstract State space models (SSMs) like Mamba have recently attracted much attention. Compared to Transformer-based large language models (LLMs), Mamba achieves linear computation complexity with the sequence length and demonstrates superior performance. However, Mamba is hard to accelerate due to the scattered activation outliers and the complex computation dependency, rendering existing LLM accelerators inefficient. In this paper, we propose LightMamba that co-designs the quantization algorithm and FPGA accelerator architecture for efficient Mamba inference. We first propose an FPGA-friendly post-training quantization algorithm that features rotation-assisted quantization and power-of-two SSM quantization to reduce the majority of computation to 4-bit. We further design an FPGA accelerator that partially unrolls the Mamba computation to balance the efficiency and hardware costs. Through computation reordering as well as fine-grained tiling and fusion, the hardware utilization and memory efficiency of the accelerator get drastically improved. We implement LightMamba on Xilinx Versal VCK190 FPGA and achieve 4.65∼6.06× higher energy efficiency over the GPU baseline. When evaluated on Alveo U280 FPGA, LightMamba reaches 93 tokens/s, which is 1.43× that of the GPU baseline.
17:15 CEST	TS14.10	EVALUATING IOMMU-BASED SHARED VIRTUAL ADDRESSING FOR RISC-V EMBEDDED HETEROGENEOUS SOCS Speaker: Cyril Koenig, ETH Zurich, CH Authors: Cyril Koenig, Enrico Zelioli and Luca Benini, ETH Zurich, CH Abstract Embedded heterogeneous Systems-on-Chips (SoCs) rely on domain-specific hardware accelerators to improve performance and energy efficiency. In particular, programmable multicore accelerators feature a cluster of processing elements and tightly coupled scratchpad memories to balance performance, energy efficiency, and flexibility. In embedded systems running a general-purpose OS, accelerators access data via dedicated, physically addressed memory regions. This negatively impacts memory utilization and performance by requiring a copy from the virtual host address to the physical accelerator address space. Input-Output Memory Management Units (IOMMUs) overcome this limitation by allowing devices and hosts to use a shared virtual, paged address space. However, resolving IO virtual addresses can be particularly costly on high-latency memory systems as it requires up to three sequential memory accesses on IOTLB miss. In this work, we present a quantitative evaluation of shared virtual addressing in RISC-V heterogeneous embedded systems. We integrate an IOMMU in an open source heterogeneous RISC-V SoC consisting of a 64-bit host with a 32-bit accelerator cluster. We evaluate the system performance by emulating the design on FPGA and implementing compute kernels from the RajaPERF benchmark suite using heterogeneous OpenMP programming. We measure transfers and computation time on the host and accelerators for systems with different DRAM access latencies. We first show that IO virtual address translation can account for 4.2% up to 17.6% of the accelerator's runtime for GEMM (General Matrix Multiplication) at low and high memory bandwidth. Then, we show that in systems containing a last-level cache, this IO address translation cost falls to 0.4% and 0.7% under the same conditions, making shared-virtual addressing and zero-copy offloading suitable for such RISC-V heterogeneous SoCs.

TS15 Power and Energy Efficient Systems

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Rhône 3AB

Session chair:
César Fuguet, Inria / TIMA, FR

Session co-chair:
Georgios Zervakis, Univesity of Patras, GR

Time	Label	Presentation Title Authors
16:30 CEST	TS15.1	LESS IS MORE: OPTIMIZING FUNCTION CALLING FOR LLM EXECUTION ON EDGE DEVICES Speaker: Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Authors: Varatheepan Paramanayakam¹, Andreas Karatzas², Iraklis Anagnostopoulos² and Dimitrios Stamoulis³ ¹Southern Illinois University, US; ²Southern Illinois University Carbondale, US; ³The University of Texas at Austin, US Abstract The advanced function-calling capabilities of foundation models open up new possibilities for deploying agents to perform complex API tasks. However, managing large amounts of data and interacting with numerous APIs makes function calling hardware-intensive and costly, especially on edge devices. Current Large Language Models (LLMs) struggle with function calling at the edge because they cannot handle complex inputs or manage multiple tools effectively. This results in low task-completion accuracy, increased delays, and higher power consumption. In this work, we introduce Less-is-More, a novel fine-tuning-free function-calling scheme for dynamic tool selection. Our approach is based on the key insight that selectively reducing the number of tools available to LLMs significantly improves their function-calling performance, execution time, and power efficiency on edge devices. Experimental results with state-of-the-art LLMs on edge hardware show agentic success rate improvements, with execution time reduced by up to 70% and power consumption by up to 40%.
16:35 CEST	TS15.2	SSMDVFS: MICROSECOND-SCALE DVFS BASED ON SUPERVISED AND SELF-CALIBRATED ML ON GPGPUS Speaker: Minqing Sun, Shanghai Jiao Tong University, CN Authors: Minqing Sun¹, Ruiqi Sun¹, Yingtao Shen¹, Wei Yan², Qinfen Hao² and An Zou¹ ¹Shanghai Jiao Tong University, CN; ²The Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Over the past decade, as GPUs have evolved to achieve higher computational performance, their power density has also accelerated. Consequently, improving energy efficiency and reducing power consumption has become critically important. Dynamic voltage and frequency scaling (DVFS) is an effective technique for enhancing energy efficiency. With the advent of integrated voltage regulators, DVFS can now operate on microsecond (µs) timescales. However, developing a practical and effective strategy to guide rapid DVFS remains a significant challenge. This paper proposes a supervised and self-calibrated machine learning framework (SSMDVFS) to guide microsecond-scale GPU voltage and frequency scaling. This framework features an end-to-end design that encompasses data generation, neural network model design, training, compression, and final runtime calibration. Unlike analytical models, which struggle to accurately represent GPU architectures, and reinforcement learning approaches, which can be challenging to converge during runtime, the SSMDVFS offers a practical solution for guiding microsecond-scale voltage and frequency scaling. Experimental results demonstrate that the proposed framework improves energy-delay product (EDP) by 11.09% and outperforms analytical models and reinforcement learning approaches by 13.17% and 36.80%, respectively.
16:40 CEST	TS15.3	A 3D DESIGN METHODOLOGY FOR INTEGRATED WEARABLE SOCS: ENABLING ENERGY EFFICIENCY AND ENHANCED PERFORMANCE AT ISO-AREA FOOTPRINT Speaker: Ekin Sumbul, Meta, US Authors: H. Ekin Sumbul¹, Arne Symons², Lita Yang², Huichu Liu², Tony Wu², Matheus Trevisan Moreira², Debabrata Mohapatra², Abhinav Agarwal², Kaushik Ravindran², Chris Thompson², Yuecheng Li² and Edith Beigne² ¹Meta, US; ²META, US Abstract Augmented Reality (AR) System-on-Chips (SoCs) have strict power budgets and form-factor limitations for wearable, all-day use AR glasses running high-performance applications. Limited compute and memory resources that can fit within the strict industrial design area footprint of an AR SoC, however, create performance bottlenecks for demanding workloads such as Pixel Codec Avatars (PiCA) group-calling which connects multiple users with their photorealistic representations. To alleviate this unique wearables challenge, 3D integration with hybrid-bonding technology offers energy-efficient 3D stacking of more silicon resources within the same SoC footprint. Implementing such 3D architectures, however, is another challenge as current EDA tools and flows offer limited 3D design control. In this work, we present a 3D design methodology for robust 3D clock network and datapath design using current EDA tools. To validate the proposed methodology, we implemented a 3D integrated prototype AR SoC housing a 3D-stacked Machine Learning (ML) accelerator utilizing TSMC SoIC™bonding technology. Silicon measurements demonstrate that the 3D ML accelerator enables running PiCA AR group call at 30 frames-per-second (fps) by 3D-expanding its memory resources by 4× to achieve 2× better energy-efficiency when compared to a 2D baseline accelerator at iso-footprint.
16:45 CEST	TS15.4	A LOW-POWER MIXED-PRECISION INTEGRATED MULTIPLY-ACCUMULATE ARCHITECTURE FOR QUANTIZED DEEP NEURAL NETWORKS Speaker: Hu Xiaolu, Department of Micro-Nano Electronics, Shanghai Jiao Tong University, Shanghai, China, CN Authors: Xiaolu Hu¹, Xinkuang Geng¹, Zhigang Mao², Jie Han³ and Honglan Jiang¹ ¹Shanghai Jiao Tong University, CN; ²Department of Mico-Nano Electronics, CN; ³University of Alberta, CA Abstract As mixed-precision quantization techniques have been widely considered for balancing computational efficiency and flexibility in quantized deep neural networks (DNNs), mixed-precision multiply-accumulate (MAC) units are increasingly important in DNN accelerators. However, conventional mixed-precision MAC architectures support either signed×signed or unsigned×unsigned multiplications. The signed×unsigned multiplication enhancing the computing efficiency of DNNs with ReLU activations has never been considered in the design of mixed-precision MAC. Thus, this work proposes a mixed-precision MAC architecture supporting six operation modes, int8×int8, int8×uint8, two int4×int4, two int4×uint4, four int2×int2, and four int2×uint2. In this design, to balance the power and delay of different modes, the multiplication is implemented based on four precision-split 4×4 multipliers (PS4Ms). The accumulation is integrated into the partial product accumulation of the multiplication to eliminate redundant switching activities in separate compression. With 10% area reduction, the proposed MAC denoted as PS4MAC, reduces the power by over 35%, 42%, and 56% for 8-bit, 4-bit, and 2-bit operations, respectively, compared with the design based on the Synopsys DesignWare (DW) multipliers. Additionally, it achieves over 23% power savings for 8-bit operations compared to state-of-the-art (SotA) mixed-precision MAC designs. To save more power, an approximate computing mode for 8-bit multiplication is further designed, resulting in a MAC unit enabling eight operation modes, referred to as PS4MAC_AP. Finally, output-stationary systolic arrays (SAs) are explored using the above-mentioned MAC designs to implement DNNs operating under a 1 GHz clock. Our designs show the highest energy efficiency and outstanding area efficiency in all 8-bit, 4-bit, and 2-bit operation modes. Compared with the traditional SA with high-precision-split multipliers, PS4MAC_AP improves the energy efficiency for 8-bit operations by 0.6 TOPS/W, and PS4MAC achieves 0.4 TOPS/W - 0.7 TOPS/W improvement for all operation modes.
16:50 CEST	TS15.5	FEDERATED REINFORCEMENT LEARNING FOR OPTIMIZING THE POWER EFFICIENCY OF EDGE DEVICES Speaker: Benedikt Dietrich, Karlsruhe Institute of Technology, DE Authors: Benedikt Dietrich¹, Rasmus Müller-Both², Heba Khdr³ and Joerg Henkel³ ¹Chair for Embedded Systems, Karlsruhe Institute of Technology, DE; ²-, DE; ³Karlsruhe Institute of Technology, DE Abstract Reinforcement learning (RL) holds great promise for adaptively optimizing microprocessor performance under power constraints. It allows for online learning of application charac- teristics at runtime and enables adjustment to varying system dynamics such as changes in the workload, user preferences or ambient conditions. However, online policy optimization remains resource-intensive, with high computational demand and requiring many samples to converge, making it challenging to deploy to edge devices. In this work, we overcome both of these obstacles and present federated power control using dynamic voltage and frequency scaling (DVFS). Our technique leverages federated RL and enables multiple independent power controllers running on separate devices to collaboratively train a shared DVFS policy, consolidating experience from a multitude of different applica- tions, while ensuring that no privacy-sensitive information leaves the devices. This leads to faster convergence and to increased robustness of the learned policies. We show that our federated power control achieves 57 % average performance improvements over a policy that is only trained on local data. Compared to a state-of-the-art collaborative power control, our technique leads to 22 % better performance on average for the running applications under the same power constraint.
16:55 CEST	TS15.6	AXON: A NOVEL SYSTOLIC ARRAY ARCHITECTURE FOR IMPROVED RUN TIME AND ENERGY EFFICIENT GEMM AND CONV OPERATION WITH ON-CHIP IM2COL Speaker: Md Mizanur Rahaman Nayan, Georgia Tech, US Authors: Md Mizanur Rahaman Nayan¹, Ritik Raj¹, Gouse Shaik Basha¹, Tushar Krishna² and Azad J Naeemi¹ ¹Georgia Tech, US; ²tushar, US Abstract General matrix multiplication (GeMM) is a core operation in virtually all AI applications. Systolic array (SA) based architectures have shown great promise as GeMM hardware accelerators thanks to their speed and energy efficiency. Unfortunately, SAs incur a linear delay in filling the operands, due to unidirectional propogation via pipeline latches. In this work, we propose a novel in-array data orchestration technique in SAs where we enable data feeding on the principal diagonal followed by bi-directional propagation. This improves the runtime by up to 2× at minimal hardware overhead. In addition, the proposed data orchestration enables convolution lowering (known as im2col) using a simple hardware support to fully exploit input feature map reuse opportunity and significantly lower the off-chip memory traffic resulting in 1.2× throughput improvement and 2.17× inference energy reduction during YOLOv3 and RESNET50 workload on average. In contrast, conventional data orchestration would require more elaborate hardware and control signals to implement im2col in hardware because of the data skew. We have synthesized and conducted place and route for 16×16 systolic arrays based on the novel and conventional orchestrations using ASAP 7nm PDK and found that our proposed approach results in 0.211% area and 1.6% power overheads
17:00 CEST	TS15.7	TEMPUS CORE: AREA-POWER EFFICIENT TEMPORAL-UNARY CONVOLUTION CORE FOR LOW-PRECISION EDGE DLAS Speaker: Prabhu Vellaisamy, Carnegie Mellon University, US Authors: Prabhu Vellaisamy¹, Harideep Nair¹, Thomas Kang¹, Yichen Ni¹, Haoyang Fan¹, Bin Qi¹, Hsien-Fu Hung¹, Jeff Chen¹, Shawn Blanton¹ and John Shen² ¹Carnegie Mellon University, US; ²Carnegie Mellon university, US Abstract The increasing complexity of deep neural networks (DNNs) poses significant challenges for edge inference deployment due to resource and power constraints of edge devices. Recent works on unary-based matrix multiplication hardware aim to leverage data sparsity and low-precision values to enhance hardware efficiency. However, the adoption and integration of such unary hardware into commercial deep learning accelerators (DLA) remain limited due to processing element (PE) array dataflow differences. This work presents Tempus Core, a convolution core with highly scalable unary-based PE array comprising of tub (temporal-unary-binary) multipliers that seamlessly integrates with the NVDLA (NVIDIA's open-source DLA for accelerating CNNs) while maintaining dataflow compliance and boosting hardware efficiency. Analysis across various datapath granularities shows that for INT8 precision in 45nm CMOS, Tempus Core's PE cell unit (PCU) yields 59.3% and 15.3% reductions in area and power consumption, respectively, over NVDLA's CMAC unit. Considering a 16x16 PE array in Tempus Core, area and power improves by 75% and 62%, respectively, while delivering 5x and 4x iso-area throughput improvements for INT8 and INT4 precisions. Post-place and route analysis of Tempus Core's PCU shows that the 16x4 PE array for INT4 precision in 45nm CMOS requires only 0.017mm^2 die area and consumes only 6.2mW of total power. We demonstrate that area-power efficient unary-based hardware can be seamlessly integrated into conventional DLAs, paving the path for efficient unary hardware for edge AI inference.
17:05 CEST	TS15.8	ADAPTIVE MULTI-THRESHOLD ENCODING FOR ENERGY-EFFICIENT ECG CLASSIFICATION ARCHITECTURE USING SPIKING NEURAL NETWORK Speaker: Mohammad Amin Yaldagard, TU Delft, NL Authors: Sumit Diware, Yingzhou Dong, Mohammad Amin Yaldagard and Rajendra Bishnoi, TU Delft, NL Abstract Timely identification of cardiac arrhythmia (abnormal heartbeats) is vital for early diagnosis of cardiovascular diseases. Wearable healthcare devices facilitate this process by recording heartbeats through electrocardiogram (ECG) signals and using AI-driven hardware to classify them into arrhythmia classes. Spiking neural networks (SNNs) are well-suited for such hardware as they consume low energy due to event-driven operation. However, their energy-efficiency is constrained by encoding methods that translate real-valued ECG data into spikes. In this paper, we present an SNN-based ECG classification architecture featuring a new adaptive multi-threshold spike encoding scheme. This scheme adjusts encoding window and granularity based on the importance of ECG data samples, to capture essential information with fewer spikes. We develop a high-accuracy SNN model for such spike representation, by proposing a technique specifically tailored to our encoding. We design a hardware architecture for this model, which incorporates optimized layer post-processing for energy-efficient data-flow and employs fixed-point quantization for computational efficiency. Moreover, we integrate this architecture with our encoding scheme into a system-on-chip implementation using TSMC 40nm technology. Results show that our proposed approach achieves better energy-efficiency compared to state-of-the-art, with high ECG classification accuracy.
17:10 CEST	TS15.9	LOWGRADQ: ADAPTIVE GRADIENT QUANTIZATION FOR LOW-BIT CNN TRAINING VIA KERNEL DENSITY ESTIMATION-GUIDED THRESHOLDING AND HARDWARE-EFFICIENT STOCHASTIC ROUNDING UNIT Speaker: Sangbeom Jeong, Seoul National University of Science and Technology, KR Authors: Sangbeom Jeong¹, Seungil Lee¹ and Hyun Kim² ¹Seoul National University of Science and Technology, Department of Electrical and Information Engineering, KR; ²Seoul National University of Science and Technology, KR Abstract This paper proposes a hardware-efficient INT8 training framework with dual-scale adaptive gradient quantization (DAGQ) to cope with the growing need for efficient on-device CNN training. DAGQ captures both small- and large-magnitude gradients, ensuring robust low-bit training with minimal quantization error. Additionally, to reduce the computational and memory demands of stochastic rounding in low-bit training, we introduce a reusable LFSR-based stochastic rounding unit (RLSRU), which efficiently generates and reuses random numbers, minimizing hardware complexity. The proposed framework achieves stable INT8 training across various networks with minimal accuracy loss while being implementable on RTL-based hardware accelerators, making it well-suited for resource-constrained environments.
17:11 CEST	TS15.10	PFASWARE: QUANTIFYING THE ENVIRONMENTAL IMPACT OF PER- AND POLYFLUOROALKYL SUBSTANCES (PFAS) IN COMPUTING SYSTEMS Speaker: Mariam Elgamal, Harvard University, US Authors: Mariam Elgamal¹, Abdulrahman Mahmoud², Gu-Yeon Wei¹, David Brooks¹ and Gage Hills¹ ¹Harvard University, US; ²Mohamed bin Zayed University of Artificial Intelligence, AE Abstract PFAS (per- and poly-fluoroalkyl substances), also known as forever chemicals, are widely used in electronics and semiconductor manufacturing. PFAS are environmentally persistent and bioaccumulative synthetic chemicals, which have recently received considerable regulatory attention. Manufacturing semiconductors and electronics, including integrated circuits (IC), batteries, displays, etc., currently accounts for a staggering 10% of the total PFAS-containing fluoropolymers used in Europe alone. Now, computer system designers have an opportunity to reduce the use of PFAS in semiconductors and electronics at the design phase. In this work, we quantify the environmental impact of PFAS in computing systems, and outline how designers can optimize their designs to use less PFAS. We show that manufacturing an IC design at a 7 nm technology node using Extreme Ultraviolet (EUV) lithography uses 20% less volume of PFAS-containing chemicals versus manufacturing the same design at a 7 nm node using Deep Ultraviolet (DUV) immersion lithography (instead of EUV). We also show that manufacturing an IC design at a 16 nm technology node results in 15% less volume of PFAS than manufacturing the same design at a 28 nm node due to its smaller area.
17:12 CEST	TS15.11	FAST MACHINE LEARNING BASED PREDICTION FOR TEMPERATURE SIMULATION USING COMPACT MODELS Speaker: Ayse Coskun, Boston University, US Authors: Mohammadamin Hajikhodaverdian¹, Sherief Reda² and Ayse Coskun¹ ¹Boston University, US; ²Brown University, US Abstract As transistor densities increase, managing thermal challenges in 3D IC designs becomes more complex. Traditional methods like finite element methods and compact thermal models (CTMs) are computationally expensive, while existing machine learning (ML) models require large datasets and a long training time. To address these challenges with the ML models, we introduce a novel ML framework that integrates with CTMs to accelerate steady-state thermal simulations without needing large datasets. Our approach achieves up to 70× speedup over state-of-the-art simulators, enabling real-time, high-resolution thermal simulations for 2D and 3D IC designs.
17:13 CEST	TS15.12	CPP-SGS :CYCLE-ACCURATE POWER PREDICTION FRAMEWORK VIA SNN AND GENETIC SIGNAL SELECTION Speaker: Tong LIU, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Tong Liu¹, Zijun JIANG² and Yangdi Lyu¹ ¹The Hong Kong University of Science and Technology (Guangzhou), CN; ²Hong Kong University of Science & Technology (Guangzhou), CN Abstract Effective power management is crucial for optimizing the performance and longevity of integrated circuits. Cycle-accurate power prediction can help power management during runtime. This paper introduces a Cycle-accurate Power Prediction framework via Spiking neural networks (SNNs) and Genetic signal Selection (CPP-SGS), which integrates SNNs and Genetic Algorithms (GAs) to predict real-time power consumption of chips. We apply GAs to select the most relevant signals as the input to SNNs to reduce the model size and inference time, making it well-suited for dynamic power estimation in real-time scenarios. The experimental results show that CCP-SGS outperforms the state-of-the-art approaches, with a normalized root mean squared error (NRMSE) of less than 1.6%.

TS16 Design, Test, Modeling and Mitigation of defects and faults

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: St Clair 3AB

Session chair:
Hussam Amrouch, TU Munich, DE

Session co-chair:
Jerzy Tyszer, Univesity of Poznan, PL

Time	Label	Presentation Title Authors
16:30 CEST	TS16.1	FUSIS: FUSING SURROGATE MODELS AND IMPORTANCE SAMPLING FOR EFFICIENT YIELD ESTIMATION Speaker: Wei Xing, The University of Sheffield, GB Authors: Yanfang Liu¹ and Wei Xing² ¹Beihang University, CN; ²The University of Sheffield, GB Abstract As process nodes continue to shrink, yield estimation has become increasingly critical in modern circuit design. Traditional approaches face significant challenges: surrogate-based methods often struggle with robustness and accuracy, whereas importance sampling (IS)-based methods suffer from high simulation costs. To address these challenges simultaneously, we propose FUSIS, a unified framework that combines the strengths of surrogate-based and IS-based approaches. Unlike conventional surrogate-based methods that directly replace SPICE simulations for performance predictions, FUSIS employs a Deep Kernel support vector machine (SVM) as an approximation of the indicator function, which is further utilized to construct a quasi-optimal proposal distribution for IS to accelerate convergence. To further mitigate yield estimation bias caused by surrogate inaccuracies, we introduce a novel correction factor to adjust the IS-based yield estimation. Experiments conducted on SRAM and analog circuits demonstrate that FUSIS significantly improves accuracy by up to 24.84% (8.67% on average) while achieving up to 29.54x (10.30x on average) speedup in efficiency compared to seven state-of-the-art methods.
16:35 CEST	TS16.2	ROTA: ROTATIONAL TORUS ACCELERATOR FOR WEAR LEVELING OF NEURAL PROCESSING ELEMENTS Speaker: Taesoo Lim, Yonsei University, KR Authors: Taesoo Lim, Hyeonjin Kim, JINGU PARK, Bogil Kim and William Song, Yonsei University, KR Abstract This paper introduces a reliability-aware neural accelerator design with a wear-leveling solution that balances the utilization of processing elements (PEs). Neural accelerators deploy many PEs to exploit data-level parallelism, but their designs and operations have focused mostly on performance and energy efficiency metrics. Directional dataflows in PE arrays and dimensional misalignment with variable-sized neural layers cause the underutilization of PEs, which is biased to PE locations and gradually accumulated over time. Consequently, the accelerators experience severe usage imbalance between PEs. To resolve the problem, this paper proposes a rotational torus accelerator (RoTA) with an optimized wear-leveling scheme that shuffles PE utilization spaces to eliminate PE usage imbalance. Evaluation results show that RoTA improves lifetime reliability by 1.69x.
16:40 CEST	TS16.3	LOCATION IS ALL YOU NEED: EFFICIENT LITHOGRAPHIC HOTSPOT DETECTION USING ONLY POLYGON LOCATIONS Speaker: Kang Liu, Huazhong University of Science and Technology, CN Authors: Yujia Wang¹, Jiaxing Wang¹, Dan Feng¹, Yuzhe Ma² and Kang Liu¹ ¹Huazhong University of Science and Technology, CN; ²The Hong Kong University of Science and Technology (Guangzhou), CN Abstract With integrated circuits at advanced technology nodes shrinking in feature size, lithographic hotspot detection has become increasingly important. Deep learning, especially convolutional neural networks (CNNs) and graph neural networks (GNNs) have recently succeeded in lithographic hotspot detection, where layout patterns, represented as images or graph features, are classified into hotspots and non-hotspots. However, with increasingly sophisticated CNN architectural designs, CNN-based hotspot detection requires excessive training and inference costs with expanding model sizes but only marginally improves detection accuracy. Existing GNN-based hotspot detector requires more intuitive and efficient layout graph feature representation. Driven by the understanding that lithographic hotspots result from complex interactions among metal polygons through the light system, we propose the absolute and relative locations of metal polygons are all we need to detect hotspots of a layout clip. We propose a novel layout graph feature representation for hotspot detection where the coordinates of each polygon and the distances between them are taken as node and edge features, respectively. We design an advanced GNN architecture using graph attention and different feature update functions for different edge types of polygons. Our experimental results demonstrate that our architecture achieves the highest hotspot accuracy and the lowest false alarm on different datasets. Notably, we employ one-third of the graph features of the previous GNN hotspot detector and achieve higher accuracy. We outperform all CNN hotspot detectors with higher accuracy, up to 32x speed up in inference time, and 64x reduction in model size.
16:45 CEST	TS16.4	EFFICIENT MODULATED STATE SPACE MODEL FOR MIXED-TYPE WAFER DEFECT PATTERN RECOGNITION Speaker: Mu Nie, Anhui Polytechnic University, CN Authors: Mu Nie¹, ShiDong Zhu¹, Aibin Yan², Zhuo Chen³, Xiaoqing Wen⁴ and Tianming Ni¹ ¹Anhui Polytechnic University, CN; ²Hefei University of Technology, CN; ³Zhejiang University, CN; ⁴Kyushu Institute of Technology, JP Abstract Accurate and efficient wafer defect detection is crucial in semiconductor manufacturing to maintain product quality and optimize yield. Traditional methods struggle with the complexity and diversity of modern wafer defect patterns. While deep learning approaches are effective, they are often resource-intensive, posing challenges for real-time deployment in industrial settings. To solve these problems, we propose an Efficient Modulated State Space Model (EM-SSM) for mixed-type wafer defect recognition, optimized with knowledge distillation to balance accuracy and efficiency. Our framework captures size-dependent relationships and improves defect-specific feature representation to recognize complex defects precisely. Specifically, we introduce an efficient directional modulation mechanism to refine spatial recognition of defect patterns. To further improve inference efficiency, we propose a deep-to-shallow distillation method that transfers knowledge from deeper networks to lighter networks, reducing inference time without compromising classification accuracy. Experimental results on the MixedWM38 wafer dataset with 38 defect types show that our model achieves 99.0\% accuracy, outperforming traditional methods in both accuracy and efficiency. Our model offers a scalable solution for modern semiconductor defect detection.
16:50 CEST	TS16.5	MORE-STRESS: MODEL ORDER REDUCTION BASED EFFICIENT NUMERICAL ALGORITHM FOR THERMAL STRESS SIMULATION OF TSV ARRAYS IN 2.5D/3D IC Speaker: Tianxiang Zhu, Peking University, CN Authors: Tianxiang Zhu, Qipan Wang, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN Abstract Thermomechanical stress induced by through-silicon vias (TSVs) plays an important role in the performance and reliability analysis of 2.5D/3D ICs. While the finite element method (FEM) adopted by commercial software can provide accurate simulation results, it is very time- and memory-consuming for large-scale analysis. Over the past decade, the linear superposition method has been utilized to perform fast thermal stress estimations of TSV arrays, but it suffers from a lack of accuracy. In this paper, we propose MORE-Stress, a novel strict numerical algorithm for efficient thermal stress simulation of TSV arrays based on model order reduction. Experimental results demonstrate that our algorithm can realize a 153-504x reduction in simulation time and a 39-115x reduction in memory usage compared with the commercial software ANSYS, with negligible errors less than 1%. Our algorithm is as efficient as the linear superposition method, with an order of magnitude smaller errors and fast convergence.
16:55 CEST	TS16.6	DYNAMIC IR-DROP PREDICTION THROUGH A MULTI-TASK U-NET WITH PACKAGE EFFECT CONSIDERATION Speaker: Yu-Hsuan Chen, National Tsing Hua University, Taiwan, TW Authors: Yu-Hsuan Chen¹, Yu-Chen Cheng¹, Yong-Fong Chang², Yu-Che Lee¹, Jia-Wei Lin², Hsun-Wei Pao², Peng-Wen Chen², Po-Yu Chen², Hao-Yun Chen², Yung-Chih Chen³, Chun-Yao Wang¹ and Shih-Chieh Chang¹ ¹National Tsing Hua University, TW; ²Mediatek Inc, Taiwan, TW; ³National Taiwan University of Science and Technology, TW Abstract Dynamic IR drop analysis is a critical step in the design signoff stage for verifying the power integrity of a chip. Since the analysis is extremely time-consuming, it has led to the emergence of machine learning (ML)-based methods to expedite the procedure. While previous ML approaches have demonstrated the feasibility of IR drop prediction, they often neglect package effects and do not address diverse IR criteria for memory and standard cells. Thus, this paper introduces a novel ML-based approach designed for a fast and accurate prediction of multi-type IR drop, considering package effects. We develop new package- related features to account for the package impact on IR drop. The proposed model is based on a multi-task U-net architecture that not only predicts two types of IR drops simultaneously but also increases prediction accuracy through comprehensive learning. To further enhance the model performance, we introduce the Input Fusion Block (IFB), which unifies units across channels within the input feature maps, leading to improved prediction accuracy. The experimental results show the across-pattern transferability of the proposed IR drop prediction method, demonstrating an RMSE of less than 5mV and an MAE of less than 2mV on the unseen simulation patterns. Additionally, our proposed method achieves a 5X speed-up compared to the commercial tool.
17:00 CEST	TS16.7	MINIMUM TIME MAXIMUM FAULT COVERAGE TESTING OF SPIKING NEURAL NETWORKS Speaker: Spyridon Raptis, Sorbonne Université, CNRS, LIP6, FR Authors: Spyridon Raptis¹ and Haralampos-G. Stratigopoulos² ¹Sorbonne Université, CNRS, LIP6, FR; ²Sorbonne University, CNRS, LIP6, FR Abstract We present a novel test generation algorithm for hardware accelerators of Spiking Neural Networks (SNNs). The algorithm is based on advanced optimization tailored for the spiking domain. It adaptively crafts input samples towards high coverage of hardware-level faults. Time-consuming fault simulation during test generation is circumvented by defining loss functions targeting the maximization of fault sensitisation and fault effect propagation to the output. Comparing the proposed algorithm to the existing ones on three benchmarks, it scales up for large SNN models, and it drastically reduces the test generation runtime from days to hours and the test duration from minutes to seconds. The resultant test input shows near perfect fault coverage and has a duration equivalent to a few dataset samples, thus, besides post-manufacturing testing, it is also suited for in-field testing.
17:05 CEST	TS16.8	EGIS: ENTROPY GUIDED IMAGE SYNTHESIS FOR DATASET-AGNOSTIC TESTING OF RRAM-BASED DNNS Speaker: Anurup Saha, Georgia Tech, US Authors: Anurup Saha, Chandramouli Amarnath, Kwondo Ma and Abhijit Chatterjee, Georgia Tech, US Abstract While resistive random access memory (RRAM) based deep neural networks (DNN) are important for low-power inference in IoT and edge applications, they are vulnerable to the effects of manufacturing process variations that degrade their performance (classification accuracy). However, to test the same post-manufacture, the (image) dataset used to train the associated machine learning applications may not be available to the RRAM crossbar manufacturer for privacy reasons. As such, the performance of DNNs needs to be assessed with carefully crafted dataset-agnostic synthetic test images that expose anomalies in the crossbar manufacturing process to the maximum extent possible. In this work, we propose a dataset-agnostic post-manufacture testing framework for RRAM-based DNNs using Entropy Guided Image Synthesis (EGIS). We first create a synthetic image dataset such that the DNN outputs corresponding to the synthetic images minimize an entropy-based loss metric. Next, a small subset (consisting of 10-20 images) of the synthetic image dataset, called the compact image dataset, is created to expedite testing. The response of the device under test (DUT) to the compact image dataset is passed to a machine learning based outlier detector for pass/fail labeling of the DUT. It is seen that the test accuracy using such synthetic test images is very close to that of contemporary test methods.
17:10 CEST	TS16.9	NVSRLO: A FEFET-BASED NON-VOLATILE AND SEU-RECOVERABLE LATCH DESIGN WITH OPTIMIZED OVERHEAD Speaker: Wangjin Jiang, Hefei University of Technology, CN Authors: Aibin Yan¹, Wangjin Jiang¹, Han Bao¹, Zhengfeng Huang¹, Tianming Ni², Xiaoqing Wen³ and Patrick Girard⁴ ¹Hefei University of Technology, CN; ²Anhui Polytechnic University, CN; ³Kyushu Institute of Technology, JP; ⁴LIRMM, FR Abstract This paper presents a FeFET-based non-volatile and single-event upset (SEU) recoverable latch, namely NVSRLO, which does not require any extra control signals. Simulation results show that the proposed latch provides non-volatility and SEU-recovery with optimized overhead. Compared with existing non-volatile latches, NVSRLO significantly reduces delay, power, and delay-power-area product at the cost of area.
17:11 CEST	TS16.10	INTERA-ECC: INTERCONNECT-AWARE ERROR CORRECTION IN STT-MRAM Speaker: Surendra Hemaram, Karlsruhe Institute of Technology, DE Authors: Surendra Hemaram¹, Mahta Mayahinia¹, Mehdi Tahoori¹, Francky Catthoor², Siddharth Rao², Sebastien Couet², Tommaso Marinelli³, Anita Farokhnejad² and Gouri Kar² ¹Karlsruhe Institute of Technology, DE; ²IMEC, BE; ³imec, BE Abstract Spin-transfer torque magnetic random access memory (STT-MRAM) is a promising alternative to existing memory technologies. However, STT-MRAM faces reliability challenges, primarily due to stochastic switching, process variation, and manufacturing defects. These reliability challenges become even worse due to interconnect parasitic resistive-capacitive effects, potentially compromising the reliability of memory cells located far from the write driver. This can severely impair the manufacturing yield and large-scale industrial adoption. To address this, we propose an interconnect-aware error correction coding (InterA-ECC), which provides non-uniform error correction to a different zone of the memory subarray. The proposed InterA-ECC strategy selectively applies robust error-correction code (ECC) to specific rows within the subarray rather than uniformly across all rows, reducing ECC parity bits while enhancing bit error rate resiliency in the most vulnerable memory zone.
17:12 CEST	TS16.11	ASSESSING SOFT ERROR RELIABILITY IN VECTORIZED KERNELS: VULNERABILITY AND PERFORMANCE TRADE-OFFS ON ARM AND RISC-V ISAS Speaker and Author: Geancarlo Abich, UFRGS, BR Abstract The demand for advanced processing capabilities is paramount in the ever-evolving landscape of radiation-resilient computing exploration. With the standardization of vector extensions on Arm and Risc-V ISAs, leading technology companies are adopting high-performance processors to exploit vector capabilities. In this regard, this work proposes an automatized register's cross-section reliability evaluation while extending the uniform random register file fault injection to assess the increased vulnerability with the vector register length. Such a technique enables soft error reliability assessment of vector extensions from RISC-V and Arm while comparing with scalar counterparts over different integer and FP precisions. The obtained results show that soft error criticality correlates to registers' cross-section, and the vectorized benchmarks presented up to 78% error susceptibility in comparison to 6% in scalar versions while varying according to precision. This emphasizes the necessity of balancing performance and reliability in the emerging onboard platforms with vector capabilities.
17:13 CEST	TS16.12	EARLY FUNCTIONAL SAFETY AND PPA EVALUATION OF DIGITAL DESIGNS Speaker: Michelangelo Bartolomucci, Politecnico di Torino, IT Authors: Michelangelo Bartolomucci¹, David Kingston², Teo Cupaiuolo³, Alessandra Nardi⁴ and Riccardo Cantoro¹ ¹Politecnico di Torino, IT; ²Synopsys, GB; ³Synopsys, IT; ⁴Synopsys, US Abstract The use of semiconductor devices in safety-critical scenarios is increasing in both quantity and complexity. This paper presents a novel approach to support safety requirements from RTL exploration through to implementation, with the aid of a Safety Specification Format (SSF), thereby minimizing costly development iterations and reducing the Time-To-Market. An assessment of the results is given for the CV32E40P open source RISC-V processor.

PARTY DATE Party

Add this session to my calendar

Date: Tuesday, 01 April 2025
Time: 19:30 CEST - 23:00 CEST
Location / Room: Palais de la Bourse

supported by

Time	Label	Presentation Title Authors
19:30 CEST	PARTY.1	ARRIVAL AND WELCOME Presenter: Aida Todri-Sanial, Eindhoven University of Technology, NL Authors: Aida Todri-Sanial¹ and Theocharis Theocharides² ¹Eindhoven University of Technology, NL; ²University of Cyprus, CY Abstract Arrival and Welcome at DATE Party
20:00 CEST	PARTY.2	PRESENTATION OF AWARDS Presenter: Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Authors: Jürgen Teich¹, Aida Todri-Sanial², Theocharis Theocharides³, Robert Wille⁴ and Ian O'Connor⁵ ¹Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; ²Eindhoven University of Technology, NL; ³University of Cyprus, CY; ⁴TU Munich, DE; ⁵Lyon Institute of Nanotechnology, FR Abstract Presentation of Awards
20:30 CEST	PARTY.3	DATE PARTY WITH DRINKS&FOODBARS Presenter: All Participants, DATE, FR Author: All Participants, DATE, FR Abstract Drinks and foodbars at DATE Party

Wednesday, 02 April 2025

ES Executive Session

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Roseraie 1&2

Organiser:
Catherine Le Lan, cfranssusynopsys [dot] com, FR

Time	Label	Presentation Title Authors
08:30 CEST	ES.1	REVOLUTIONIZING SOC DESIGN: THE SHIFT TO CHIPLET-BASED ARCHITECTURES Presenter: Gary Dick, Cadence Design Systems, GB Author: Gary Dick, Cadence Design Systems, GB Abstract The semiconductor industry is undergoing a major transformation from traditional monolithic system-on-chip (SoC) architectures to modular, chiplet-based designs. This shift aims to mitigate the complexities associated with scaling designs, optimizing yields, and addressing rising fabrication costs. Economic drivers, such as increasing transistor costs and diminishing returns from Moore's Law, are fueling this transition. To navigate this change, the industry needs advanced solutions that address a wide range of system requirements and facilitate efficient design through integrated architecture, tools, flows, and system IP.
09:00 CEST	ES.2	OPPORTUNITIES IN SILICON HEALTH MANAGEMENT FOR CHIPLETS & MULTI-DIE SYSTEMS Presenter: Yervant Zorian, Synopsys, US Author: Yervant Zorian, Synopsys, US Abstract With increasing system complexity and stringent runtime requirements for AI accelerators, high-performance computing and autonomous vehicles, reliable, safe and secure operation of electronic systems are still a major challenge, particularly, with the increased use of third party chiplets and multi-die systems. This talk will focus on optimizing silicon health by using advanced solutions throughout the silicon life cycle stages, from chiplet design, to bring up, volume production, stacking, 3D packaging and in-field operation. The advanced solutions for silicon health management to be discussed will start by embedding a range of monitoring engines in different levels of the design, access mechanisms and solutions for on-chip and across the chips network, as well as data analytics at the edge and in the cloud for fleet optimization.
09:30 CEST	ES.3	POWERING EUROPE'S SEMICONDUCTOR FUTURE BY BUILDING A POOL OF SKILLED TALENT AND DRIVING INNOVATION Speaker: Tal Zigman, Cadence Design Systems, IL and Catherine Le Lan, Synopsys, FR Authors: Tal Zigman¹ and catherine Le Lan² ¹Cadence Design Systems, IL; ²SYNOPSYS, FR Abstract The semiconductor industry is experiencing unprecedented growth, accompanied by significant challenges, particularly in Europe which aspires to capture 20% of the global market share. While emerging technologies such as Generative AI, AI-enhanced Electronic Design Automation (EDA), and advanced cloud services offer promising solutions to some of these challenges, the most pressing issue in Europe lies in cultivating a robust talent pipeline, encompassing comprehensive education, strategic acquisition, and continuous development, to bridge the skill gap. This presentation will delve into the challenges in addressing Europe's semiconductor talent crisis highlighting workforce development strategy to attract, develop, and retain top-tier talent. This to drive innovation and to position Europe to meet the evolving needs of the semiconductor industry.

FS07 Focus Session - European Startups on AI: Path to Success

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Rhône 1

Session chair:
Anton Klotz, Fraunhofer, DE

Session co-chair:
Marco Inglardi, Synopsys, IT

AI is one of the hottest and influential topics of this decade. Specialized hardware is required to run AI algorithms and several companies are working on designing such hardware, among these companies there are several European startups. There are multiple challenges that these startups have to overcome like lack of financing, lack of skilled workforce, challenge to find interested customer, who take the risk and work with a startup and not with an established market leader. In this session several startups come to word and tell how they have managed to overcome the challenges. We also have perspectives from a commercial startup incubator, which has specialized on microelectronics startups and from academic, who has spin-off several startups. After impulse presentations, there will be a panel discussion, where the panelists will answer the questions from the audience on the landscape of microelectronic startups in Europe.

Time	Label	Presentation Title Authors
08:35 CEST	FS07.1	HOW FAR CAN YOU BOOTSTRAP AI SILICON? Presenter: Ke-Quang Nguyen-Phuc, NEUrXCORE, FR Author: Ke-Quang Nguyen-Phuc, NEUrXCORE, FR Abstract .
08:45 CEST	FS07.2	ACCELERATING SMART WEARABLES DEVELOPMENT WITH THE OPEN-SOURCE X-HEEP EDGE AI PLATFORM Presenter: David Atienza, EPFL, CH Author: David Atienza, EPFL, CH Abstract .
08:55 CEST	FS07.3	SCALING-OUT / ACCELERATING AXELERA AI Presenter: Edith Euan Diaz, Axelera, NL Author: Edith Euan Diaz, Axelera, NL Abstract .
09:05 CEST	FS07.4	SILICON CATALYST EXPANDS TO MAINLAND EUROPE, WHAT TO EXPECT? Presenter: Sean Redmond, Silicon Catalyst, GB Author: Sean Redmond, Silicon Catalyst, GB Abstract .
09:15 CEST	FS07.5	PANEL DISCUSSION Presenter: All the Panelists, DATE 2025, FR Author: All the Panelists, DATE 2025, FR Abstract .

LBR01 Late Breaking Results on AI Algorithms and Architectures

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Rhône 2

Session chair:
Francesco Conti, Università di Bologna, It

Session co-chair:
Cédric Marchand, Ecole Centrale de Lyon, Fr

Time	Label	Presentation Title Authors
08:30 CEST	LBR01.1	LATE BREAKING RESULTS: DYNAMICALLY SCALABLE PRUNING FOR TRANSFORMER-BASED LARGE LANGUAGE MODELS Speaker: Junyoung Lee, DGIST, KR Authors: Junyoung Lee¹, Shinhyoung Jang², Seohyun Kim², Jongho Park², Il Hong Suh³, Hoon Sung Chwa² and Yeseong Kim² ¹Daegu Gyeongbuk Institute of Science and Technology, KR; ²DGIST, KR; ³coga-robotics, KR Abstract We propose Design, a novel framework for transformer model pruning, enabling dynamic runtime controls while maintaining competitive accuracy to modern large language models (LLMs). Design incrementally constructs submodels with varying complexities, allowing runtime adaptation without maintaining separate models. Our evaluations on LLaMA-7B demonstrate that Matryoshka achieves up to 34\% speedup and outperforms the quality of state-of-the-art pruning methods, providing a flexible solution for deploying LLMs.
08:33 CEST	LBR01.2	LATE BREAKING RESULTS: AFS: IMPROVING ACCURACY OF QUANTIZED MAMBA VIA AGGRESSIVE FORGETTING STRATEGY Speaker: Zhouquan Liu, National University of Defense Technology, CN Authors: Zhouquan Liu¹, Libo Huang¹, Ling Yang¹, Gang Chen², Wei Liu¹, Mingche Lai¹ and Yongwen Wang¹ ¹National University of Defense Technology, CN; ²Sun Yat-Sen University, CN Abstract Mamba overcomes the quadratic complexity problem inherent in Transformer models while maintaining comparable contextual modeling capabilities. However, Mamba-based foundation models encounter challenges in achieving efficient inference on resource-constrained devices, primarily due to their considerable size. Model compression techniques, such as linear quantization, offer a viable solution to this problem. Nevertheless, the introduction of significant outliers during Mamba's past-state forgetting process can lead to a notable decrease in accuracy when employing linear quantization. To overcome these challenges, this paper introduces the Aggressive Forgetting Strategy (AFS), an innovative and efficient algorithm designed to mitigate the quantization issues caused by outliers in the state forgetting mechanism. AFS incorporates a computation-free approach for handling outliers, facilitating both efficient and accurate linear quantization for Mamba, which is essential for applications in resource-constrained scenarios. By leveraging the AFS strategy, Mamba can perform more efficient inference, while significantly improving the accuracy by up to 21.5$ imes$ compared to conventional methods.
08:36 CEST	LBR01.3	LATE BREAKING RESULTS: IMPROVING DEEP SNNS WITH GRADIENT CLIPPING AND NOISE EXPLOITATION IN NEUROMORPHIC DEVICES Speaker: Seongsik Park, Korea Institute of Science and Technology, KR Authors: Seongsik Park¹, Jongkil Park², Hyun Jae Jang¹, Jaewook Kim¹, YeonJoo Jeong¹, Gyu Weon Hwang¹, Inho Kim¹, JONG-KEUK PARK¹, KYEONG SEOK LEE¹ and Suyoun Lee¹ ¹Korea Institute of Science and Technology, KR; ²KIST, KR Abstract Deep spiking neural networks (SNNs) have shown remarkable progress due to improvements, such as training algorithms. However, most of them have not considered the features of neuromorphic devices. Their improvement has relied on soft resets, which are computationally expensive and unsuitable for neuromorphic devices. To address this, this paper proposes gradient clipping in hard reset-based deep SNNs and explores how device noise enhances learning performance. According to our experiments on various datasets and models, the proposed approach improved the training performance of deep SNNs with hard reset. These findings bridge gaps between SNN algorithms and hardware constraints, paving the way for efficient neuromorphic computing.
08:39 CEST	LBR01.4	LATE BREAKING RESULTS: HYPERDIMENSIONAL REGRESSION WITH FINE-GRAINED AND SCALABLE CONFIDENCE-BASED LEARNING Speaker: Jiseung Kim, DGIST, KR Authors: Jiseung Kim¹, Hyunsei Lee¹, Tajana Rosing², Mohsen Imani³ and Yeseong Kim¹ ¹DGIST, KR; ²University of California, San Diego, US; ³University of California, Irvine, US Abstract We propose an advanced hyperdimensional computing (HDC) framework for regression tasks, addressing the limitations of existing methods through three key innovations: fine-grained feature encoding, confidence-based inference, and dimension-split boosting for scalable training. By preserving inter-feature relationships and enabling efficient computation on high-dimensional spaces, the framework achieves superior accuracy and efficiency across diverse benchmarks. Our evaluation demonstrates that HBRF achieves significant improvements in prediction quality and computational efficiency as compared to the state-of-the-art HDC-based regression by 31% and 48% respectively.
08:42 CEST	LBR01.5	LATE BREAKING RESULTS: APPROXIMATED LUT-BASED NEURAL NETWORKS FOR FPGA ACCELERATED INFERENCE Speaker: Xuqi Zhu, University of Essex, GB Authors: Xuqi Zhu, Jiacheng Zhu, Huaizhi Zhang, Tamim Al-Hasan, Klaus McDonald-Maier and XIAOJUN ZHAI, University of Essex, GB Abstract This work presents LUT-MU, an approximated LUT-based Matrix Multiplication (MM) architecture designed for FPGA-based Neural Network (NN) inference across. The proposed architecture maximises the utilisation of on-chip memory bandwidth through dedicated memory distribution and pipeline design, addressing performance limitations inherent to LUT-based MM. Experimental evaluation demonstrates that LUT-MU achieves a four-fold improvement in NN inference throughput whilst reducing hardware resource consumption by 80% with only a 5% decline in accuracy. These results validate that our optimisation approach successfully resolves the performance constraints caused by the limited arithmetic intensity and memory bandwidth, enabling the LUT-MU to serve as a foundation for efficient NN acceleration systems.
08:45 CEST	LBR01.6	LATE BREAKING RESULTS: LEVERAGING APPROXIMATE COMPUTING FOR CARBON-AWARE DNN ACCELERATORS Speaker: Iraklis Anagnostopoulos, Southern Illinois University Carbondale, US Authors: Aikaterini Maria Panteleaki¹, Konstantinos Balaskas², Georgios Zervakis², Hussam Amrouch³ and Iraklis Anagnostopoulos⁴ ¹Electrical, Computer and Biomedical Engineering, Southern illinois University Carbondale, US; ²University of Patras, GR; ³TU Munich (TUM), DE; ⁴Southern Illinois University Carbondale, US Abstract The rapid growth of Machine Learning (ML) has increased demand for DNN hardware accelerators, but their embodied carbon footprint poses significant environmental challenges. This paper leverages approximate computing to design sustainable accelerators by minimizing the Carbon Delay Product (CDP). Using gate-level pruning and precision scaling, we generate area-aware approximate multipliers and optimize the accelerator design with a genetic algorithm. Results demonstrate reduced embodied carbon while meeting performance and accuracy requirements.
08:48 CEST	LBR01.7	LATE BREAKING RESULTS: ENERGY-EFFICIENT PRINTED MACHINE LEARNING CLASSIFIERS WITH SEQUENTIAL SVMS Speaker: Georgios Zervakis, University of Patras, GR Authors: Spyridon Besias, Ilias Sertaridis, Florentia Afentaki, Konstantinos Balaskas and Georgios Zervakis, University of Patras, GR Abstract Printed Electronics (PE) provide a mechanically flexible and cost-effective solution for machine learning (ML) circuits, compared to silicon-based technologies. However, due to large feature sizes, printed classifiers are limited by high power, area, and energy overheads, which restricts the realization of battery-powered systems. In this work, we design sequential printed bespoke Support Vector Machine (SVM) circuits that adhere to the power constraints of existing printed batteries while minimizing energy consumption, thereby boosting battery life. Our results show 6.5x energy savings while maintaining higher accuracy compared to the state of the art.
08:51 CEST	LBR01.8	LATE BREAKING RESULTS: TOWARDS EFFICIENT FORMAL VERIFICATION OF DOT PRODUCT ARCHITECTURES Speaker: Lennart Weingarten, University of Bremen, DE Authors: Lennart Weingarten¹, Kamalika Datta¹ and Rolf Drechsler² ¹University of Bremen, DE; ²University of Bremen \| DFKI, DE Abstract The popularity of compute intensive applications, like AI/ML, has driven the design of processors with complex functionality. The Dot Product (DP) is one of the most essential operations in modern neural processors, although no complete formal verification technique exists that can ensure its 100% correctness. In this paper we show the first step towards formally verifying DP using Symbolic Computer Algebra (SCA). The verification process is performed without the need of a reference model generation which is a key factor in verification. Experimental results show the efficiency and scalability of SCA-based verification for DP architectures.
08:54 CEST	LBR01.9	LATE BREAKING RESULTS: PRACTICAL ELECTROMAGNETIC FAULT INJECTION ON INTEL NEURAL COMPUTE STICK 2 Speaker: Marina Krček, Radboud University, NL Authors: Shivam Bhasin¹, Dirmanto Jap², Marina Krček³, Stjepan Picek⁴ and Prasanna Ravi⁵ ¹Temasek Laboratories@NTU, SG; ²Temasek Laboratories @NTU, SG; ³TU Delft, NL; ⁴Radboud University, NL; ⁵Nanyang Technological University, SG Abstract Machine learning (ML) has been widely deployed in various applications, with many applications being in critical infrastructures. One recent paradigm is edge ML, an implementation of ML on embedded devices for Internet-of-Things (IoT) applications. In this work, we have conducted a practical experiment on Intel Neural Compute Stick (NCS) 2, an edge ML device, with regard to fault injection (FI) attacks. More precisely, we have employed electromagnetic fault injection (EMFI) on NCS 2 to evaluate the practicality of the attack on a real target device. We have investigated multiple fault parameters with a low-cost pulse generator, aiming to achieve misclassification at the output of the inference. Our experimental results demonstrated the possibility of achieving practical and repeatable misclassifications.

SD03 Special Day on Emerging Computing Paradigms

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Auditorium Pasteur

Session chair:
John Paul Strachan, Forschungszentrum Juelich GmbH, DE

Sustaining increasingly challenging compute workloads requires going beyond technology scaling with von Neumann architectures in traditional CMOS. Examples include NP-hard optimisation, massive sensor processing in IoT, as well as deep learning and artificial intelligence where training compute requirements grow by ~750x every two years. Therefore, rethinking computing toward more sustainable and efficient solutions is urgently needed. This can take place by bringing computation closer to its physical substrate, or by seeking new computing paradigms across all layers of the compute stack, thereby spanning architecture, circuit, and device solutions.
The Special Day on Emerging Computing Paradigms covers key emerging topics in the areas of quantum computing, neuromorphic engineering, physics-based computing, probabilistic computing, reversible and adiabatic computing, and cellular automata. Starting with talks by key experts and closing with a panel discussion, this Special Day aims to outline tradeoffs and synergies in the wide landscape of unconventional computing approaches.

Time	Label	Presentation Title Authors
08:30 CEST	SD03.1	OPENING AND INTRODUCTION TO THE SPECIAL DAY Presenter: John Paul Strachan, Forschungszentrum Juelich GmbH, DE Author: John Paul Strachan, Forschungszentrum Juelich GmbH, DE Abstract .
08:30 CEST	SD03.2	ERROR PROPAGATION THROUGH SPACE, TIME AND THE BRAIN Presenter: Mihai Petrovici, University of Bern, CH Author: Mihai Petrovici, University of Bern, CH Abstract .
09:00 CEST	SD03.3	SPINTRONIC NEURAL NETWORKS Presenter: Julie Grollier, CNRS/Thales, FR Author: Julie Grollier, CNRS/Thales, FR Abstract .
09:30 CEST	SD03.4	SILICON PHOTONICS FOR AI - THE GOOD, THE BAD AND THE UGLY Presenter: Thomas Van Vaerenbergh, Hewlett Packard Labs, BE Author: Thomas Van Vaerenbergh, Hewlett Packard Labs, BE Abstract .

SoCL SoC Labs: The academic community for System on Chip Development

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 12:30 CEST
Location / Room: Rhône 4

This workshop has a simple aim, to present academic project results where SoC designs are fabricated in ASICs and to help develop your SoC design skills if you are a student in the Young People track or a seasoned academic. Hopefully, it will show you how your research can be fabricated in an SoC to enable high quality publication and impact. The workshop is an interesting mix of showing academic results and practical skills building. It will present actual silicon devices and results from SoC Labs academic projects including from the "Understanding our world" contest launched at DATE 24. It will provide skills and reusable SoC reference designs to help get your research to silicon. Finally we will announce our DATE 25 contest and the support we can provide during the year to help you to come back next year and present your results.
Key Objectives of the Workshop:

Show in technical detail and by example academic projects how both students and researchers can design and fabricate an ASIC based System on Chip by using community developed re-usable reference designs based on the ARM-based ecosystem.
Enable many more academic SoC design projects and make it easier for individuals and organizations who may have no prior experience, especially in ASIC fabrication
Show a range of SoC reference designs from simple microcontroller to complex Linux based systems and advanced techniques such as the use of 'Chiplets'.

Time	Label	Presentation Title Authors
08:30 CEST	SoCL.1	INTRODUCTION TO SOC LABS Presenter: John Darlington, University of Southampton, GB Author: John Darlington, University of Southampton, GB Abstract With nearly 100 Universities globally, SoC Labs is a growing academic community for System On Chip development. Its role is supporting the global academic community, and showing how re-usable design patterns can make academic research projects much more effective. The presentation then provides a quick highlight of the community-based projects from microcontroller based designs (undertaken by students) to complex A class systems (developed by research teams for real time Artificial Intelligence / Machine Learning).
08:45 CEST	SoCL.2	INTRODUCTION TO NANOSOC FOR SMALL SCALE RESEARCH PROJECTS Presenter: Daniel Newbrook, University of Southampton, GB Author: Daniel Newbrook, University of Southampton, GB Abstract nanoSoC is an academic community developed, silicon proven, low cost, SoC design extendable for custom AI/ML accelerators or mixed signal processing. It is ideal for PhD or Masters student based projects. It is a microcontroller based design easily extended to add memory-mapped experimental research hardware blocks or subsystems. The presentation will cover the stages of SoC design from initial concept to ASIC tape out and board based validation. How the nanoSoC reference design makes it easy to move from FPGA prototyping to physical silicon fabrication. nanoSoC uses common verification methods throughout the design process to enable an agile design approach, allowing redesign as research aims evolve, the re-verification of the design is easily actioned.
09:05 CEST	SoCL.3	NANOSOC AND CUSTOM ACCELERATORS FOR AI/ML Presenter: Daniel Newbrook, University of Southampton, GB Author: Daniel Newbrook, University of Southampton, GB Abstract The nanoSoC design supports development of custom accelerators for Machine Learning based inference. Academic projects are using it with leading AI/ML frameworks to take models through to ASIC fabrication. nanoSoC supports multiple types of DMA engine to provide efficient data movement through the custom AI/ML accelerator subsystem. It also supports off-chip communications interface options that support Debug Protocols and I/O streams for data transfer from persistent storage devices off chip. Daniel will explain how to add a custom accelerator IP to nanoSoC and give examples of prior projects.
09:15 CEST	SoCL.4	ARRHYTHMIA ANALYSIS ACCELERATOR : A-CUBE PROJECT Presenter: Rami Hariri, Anglia Ruskin University, GB Author: Rami Hariri, Anglia Ruskin University, GB Abstract Arrhythmia is a condition characterized by irregular heartbeats affecting millions of people worldwide. Rami has been developing an efficient hardware custom accelerator for Atrial Fibrillation beat detection. He will present work on the Behavioral Design of the AI/ML model and work to optimise the model for ASIC implementation. It will cover modelling using varying sampling, representation formats, quantisation and other parameters and their effects on performance such as accuracy as well as system cost. Iterative modelling using the python based framework, HLS4ML generated hardware description of the model. Rami will explain some implications of using generated hardware descriptions beyond FPGA implementations for critical kernels such as activation functions that can make a significant impact on accuracy, ASIC area and cost.
09:35 CEST	SoCL.5	NANOSOC AND ANALOG SUBSYSTEMS Presenter: Daniel Newbrook, University of Southampton, GB Author: Daniel Newbrook, University of Southampton, GB Abstract This section will cover adding a mixed signal subsystem to the nanosoc reference design. Such subsystems sample analog signals at a regular sampling rate, timestamp them and transmit digital representations of the signals to the rest of the nanoSoC system. It will cover design of such subsystems and examples for signal processing of analog interactions with the real world.
09:45 CEST	SoCL.6	SENSING FOR PRECISION AGRICULTURE Presenter: Erick Gomez Landeros Acosta, University of Sydney, AU Author: Erick Gomez Landeros Acosta, University of Sydney, AU Abstract Erick will present the development of a SoC design for precision agriculture based on nanoSoC. Design of a analog subsystem with set of diverse sensors for comprehensive data acquisition. Erick is one of a student project team assisted by a post doc researcher and academic mentor. As well as technical details he will share insights into how they manage their project, the project milestones and how well they have done in meeting them. SoC Labs projects are as much about the learning experience of developing a design through to ASIC fabrication as they are a technical challenge. The project has integrated re-usable IP such as Real-Time Clock for time-stamping sensor data with custom developed IP in sensor sampling and signal conditioning.
10:05 CEST	SoCL.7	SESSION ROUND UP AND ANNOUNCEMENT OF THE 2026 SOC LABS CONTEST Presenter: John Darlington, University of Southampton, GB Author: John Darlington, University of Southampton, GB Abstract Closing out the early section on nanoSoC with Q and A and announcing the new contest at DATE to stimulate more community projects in the coming year. It will include details of the support you can get for your journey from initial design to final tape out.
10:45 CEST	SoCL.8	INTRODUCTION TO SOC LABS, A CLASS DESIGN AND ALL THINGS COMPLICATED WITH EXAMPLES Presenter: John Darlington, University of Southampton, GB Author: John Darlington, University of Southampton, GB Abstract .
10:55 CEST	SoCL.9	INTRODUCTION TO A CLASS SOC DESIGN AND LARGE SCALE RESEARCH PROJECTS Presenter: Daniel Newbrook, University of Southampton, GB Author: Daniel Newbrook, University of Southampton, GB Abstract Systems for complex tasks such as autonomous vehicles, robots, smart phones, etc. require more complex application or A class processor based design. Only a few universities have fabricated such complex designs. SoC Labs is working with those universities, the EDA tool vendors and IP providers to develop a reusable reference design and fabrication flows that can be more easily adopted by academic projects. This presentation will describe this megaSoC reusable reference design.
11:25 CEST	SoCL.10	A CLASS BASED REAL-TIME EDGE AI SOC Presenter: Vaivaswatha Sai Dinesh Yarramsetty, IIT Hyderabad, IN Author: Vaivaswatha Sai Dinesh Yarramsetty, IIT Hyderabad, IN Abstract This presentation will focus on a project developing an SoC for Real-Time Edge AI that deploys custom accelerators to provide a reconfigurable system for computing various neural networks with any input image size. The presentation will discuss the complexity of integrating these accelerators, known as Neural Processing Units (NPU) with a Linux capable A class processor SoC, and highlight challenges in keeping the area and cost within a reasonable budget and the need to develop robust design flows such that agile iterations in design are possible while keeping to a delivery timeline for the academic project.
11:55 CEST	SoCL.11	CHIPLETS, FINFETS, ASIC MONITORING AND OTHER ADVANCED TECHNOLOGIES Presenter: Daniel Newbrook, University of Southampton, GB Author: Daniel Newbrook, University of Southampton, GB Abstract The utility of chiplets offers benefits in a number of ways. Some academic research requires dies fabricated using different foundry process than the digital SoC design. Chiplets provide this hetrogenous integration. Funding SoC fabrication from research grants can be difficult. Chiplets can offer novel disaggregation of a SoC with parts fabricated in the most cost effective way and integrated to form the system. Daniel will demonstrate reusable chiplet designs using EDA tool chains. The addition of IP for monitoring die using on chip sensing (Silicon Lifetime Monitoring) and SoC design at advanced technology nodes adds additional design task challenges and some of these will be outlined along with plans for FinFET implementation of SoC Labs reference designs.
12:20 CEST	SoCL.12	SESSION ROUND UP AND ANNOUNCEMENT OF THE 2026 SOC LABS CONTEST Presenter: John Darlington, University of Southampton, GB Author: John Darlington, University of Southampton, GB Abstract Closing out the late section on complex research SoC design with Q and A and announcing the new contest at DATE to stimulate more community projects in the coming year. It will include details of the support you can get for your journey from initial design to final tape out.

TS17 System simulation and validation

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: St Clair 3AB

Session chair:
Sara Vinco, Politecnico di Torino, IT

Session co-chair:
Yangdi Lyu, The Hong Kong University of Science and Technology, CN

Time	Label	Presentation Title Authors
08:30 CEST	TS17.1	FLOPPYFLOAT: AN OPEN SOURCE FLOATING POINT LIBRARY FOR INSTRUCTION SET SIMULATORS Speaker: Niko Zurstraßen, RWTH Aachen University, DE Authors: Niko Zurstraßen, Nils Bosbach and Rainer Leupers, RWTH Aachen University, DE Abstract Instruction Set Simulators (ISSs) are important software tools that facilitate the simulation of arbitrary compute systems. One of the most challenging aspects of ISS development is the modeling of Floating Point (FP) arithmetic. Despite an industry standard specifically created to avoid fragmentation, every Instruction Set Architecture (ISA) comes with an individual definition of FP arithmetic. Hence, many simulators, such as gem5 or Spike, do not use the Floating Point Unit (FPU) of the host system, but resort to soft float libraries. These libraries offer great flexibility and portability by calculating FP instructions by means of integer arithmetic. However, using tens or hundreds of integer instructions to model a single FP instruction is detrimental to the simulator's performance. Tackling the poor performance of soft float libraries, we present FloppyFloat - an open-source FP library for ISSs. FloppyFloat leverages the host FPU for basic calculations and rectifies corner cases in software. In comparison to the popular Berkeley SoftFloat, FloppyFloat achieves speedups of up to 5.5x for individual instructions. As a replacement for SoftFloat in the RISC-V golden reference simulator Spike, FloppyFloat accelerates common FP benchmarks by up to 1.41x.
08:35 CEST	TS17.2	HANDLING LATCH LOOPS IN TIMING ANALYSIS WITH IMPROVED COMPLEXITY AND DIVERGENT LOOP DETECTION Speaker: Xizhe Shi, Peking University, CN Authors: Xizhe Shi, Zizheng Guo, Yibo Lin, Runsheng Wang and Ru Huang, Peking University, CN Abstract Latch loops introduce feedback cycles in timing graphs for static timing analysis (STA), disrupting timing propagation in topological order. Existing timers handle latch loops by checking the convergence of global iterations in timing propagation without lookahead detection of divergent loops. Such a strategy ends up with the worst-case runtime complexity O(n²), where n is the number of pins in the timing graph. This can be extremely time-consuming, when n goes to millions and beyond. In this paper, we address this challenge by proposing a new algorithm consisting of two steps. First, we identify the strongly connected components (SCCs) and levelize them into different stages. Second, we implement parallelized arrival time (AT) propagation between SCCs while conducting sequential iterations inside each SCC. This strategy significantly reduces the runtime complexity to O(∑(k_i)²) from the previous global propagation, where k_i is the number of pins in each SCC. Our timer also detects timing information divergent loops in advance, avoiding over-iteration. Experimental results on industrial designs demonstrate 10.31× and 8.77× speed-up over PrimeTime and OpenSTA on average, respectively.
08:40 CEST	TS17.3	STATIC GLOBAL REGISTER ALLOCATION FOR DYNAMIC BINARY TRANSLATORS Speaker: Niko Zurstraßen, RWTH Aachen University, DE Authors: Niko Zurstraßen, Nils Bosbach, Lennart Reimann and Rainer Leupers, RWTH Aachen University, DE Abstract Dynamic Binary Translators (DBTs) facilitate the execution of binaries across different Instruction Set Architectures (ISAs). Similar to a just-in-time compiler, they recompile machine code from one ISA to another, and subsequently execute the generated code. To achieve near-native execution speed, several challenges must be overcome. This includes the problem of register allocation (RA). In classical compiler engineering, RA is often performed by global methods. However, due to the nature of DBTs, established global methods like graph coloring or linear scan are hardly applicable. This is why state-of-the-art DBTs, like QEMU, use basic-block-local methods, which come with several disadvantages. Addressing these flaws, we propose a novel global method based on static target-to-host mappings. As most applications only work on a small set of registers, mapping them statically from host to target significantly reduces load/store overhead. In a case study using our RISC-V-on-ARM64 user- mode simulator RISE SIM, we demonstrate speedups of up to 1.4× compared to basic-block-local methods.
08:45 CEST	TS17.4	CORRECTBENCH: AUTOMATIC TESTBENCH GENERATION WITH FUNCTIONAL SELF-CORRECTION USING LLMS FOR HDL DESIGN Speaker: Ruidi Qiu, TU Munich, DE Authors: Ruidi Qiu¹, Grace Li Zhang², Rolf Drechsler³, Ulf Schlichtmann¹ and Bing Li⁴ ¹TU Munich, DE; ²TU Darmstadt, DE; ³University of Bremen \| DFKI, DE; ⁴University of Siegen, DE Abstract Functional simulation is an essential step in digital hardware design. Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for hardware testbench generation tasks. However, the inherent instability associated with LLMs often leads to functional errors in the generated testbenches. Previous methods do not incorporate automatic functional correction mechanisms without human intervention and still suffer from low success rates, especially for sequential tasks. To address this issue, we propose CorrectBench, an automatic testbench generation framework with functional self-validation and self-correction. Utilizing only the RTL specification in natural language, the proposed approach can validate the correctness of the generated testbenches with a success rate of 88.85%. Furthermore, the proposed LLM-based corrector employs bug information obtained during the self-validation process to perform functional self-correction on the generated testbenches. The comparative analysis demonstrates that our method achieves a pass ratio of 70.13% across all evaluated tasks, compared with the previous LLM-based testbench generation framework's 52.18% and a direct LLM-based generation method's 33.33% Specifically in sequential circuits, our work's performance is 62.18% higher than previous work in sequential tasks and almost 5 times the pass ratio of direct method. The codes and experimental results are open-sourced at the link: https://github.com/AutoBench/CorrectBench.
08:50 CEST	TS17.5	CISGRAPH: A CONTRIBUTION-DRIVEN ACCELERATOR FOR PAIRWISE STREAMING GRAPH ANALYTICS Speaker: Songyu Feng, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Songyu Feng¹, Mo Zou² and Tian Zhi² ¹Institute of Compuiting Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract Abstract—Recent research observed that pairwise query is practical enough in real-world streaming graph analytics. Given a pair of distinct vertices, existing approaches coalesce or prune vertex activations to decrease computations. However, they still suffer from severe invalid computations because they ignore contribution variations in graph updates, hindering performance improvement. In this work, we propose to enhance pairwise analytics by taking updates contributions into account. We first identify that graph updates from one batch have a distinct impact on query results and experience obvious diverse computation overheads. We then introduce CISGraph, a novel Contributiondriven pairwise accelerator with valuable updates Identification and Scheduling. Specifically, inspired by triangle inequality, CISGraph categorizes graph updates into three levels according to contributions, prioritizes valuable updates, delays possible valuable updates, and drops useless updates to eliminate wasteful computations. As far as we know, CISGraph is the first hardware accelerator that supports efficient pairwise queries on streaming graphs. Experimental results show that CISGraph substantially outperforms state-of-the-art streaming graph processing systems by 25× on average in response time.
08:55 CEST	TS17.6	HIGH-PERFORMANCE ARM-ON-ARM VIRTUALIZATION FOR MULTICORE SYSTEMC-TLM-BASED VIRTUAL PLATFORMS Speaker: Nils Bosbach, RWTH Aachen University, DE Authors: Nils Bosbach¹, Rebecca Pelke¹, Niko Zurstraßen¹, Jan Weinstock², Lukas Jünger² and Rainer Leupers¹ ¹RWTH Aachen University, DE; ²MachineWare GmbH, DE Abstract The increasing complexity of hardware and software requires advanced development and test methodologies for modern systems on chips. This paper presents a novel approach to ARM-on-ARM virtualization within SystemC-based simulators using Linux's KVM to achieve high-performance simulation. By running target software natively on ARM-based hosts with hardware-based virtualization extensions, our method eliminates the need for instruction-set simulators, which significantly improves performance. We present a multicore SystemC-TLM-based CPU model that can be used as a drop-in replacement for an instruction-set simulator. It places no special requirements on the host system, making it compatible with various environments. Benchmark results show that our ARM-on-ARM-based virtual platform achieves up to 10 x speedup over traditional instruction-set-simulator-based models on compute-intensive workloads. Depending on the benchmark, speedups increase to more than 100 x.
09:00 CEST	TS17.7	RTHETER: SIMULATING REAL-TIME SCHEDULING OF MULTIPLE TASKS IN HETEROGENEOUS ARCHITECTURES Speaker: Yinchen Ni, Shanghai Jiao Tong University, CN Authors: Yinchen Ni¹, Jiace Zhu¹, Yier Jin² and An Zou¹ ¹Shanghai Jiao Tong University, CN; ²University of Science and Technology of China, CN Abstract The rising popularity of AI applications is driving the adoption of heterogeneous computing architectures to handle complex computations. However, as these heterogeneous architectures grow more complex, optimizing the scheduling of multiple tasks and meeting strict timing constraints becomes significantly challenging. Current studies on real-time scheduling on heterogeneous processors lack agile and flexible simulation tools that can quickly adapt to varying system settings, leading to inefficiencies in system design. Additionally, the high costs associated with evaluating real-time performance in terms of human and facility efforts further complicate the development process. To address these challenges, this paper introduces a comprehensive hierarchical simulating approach and a corresponding simulator designed for flexible heterogeneous computing platforms. The simulator supports ideal or practical, off-the-shelf or customizable heterogeneous architectures, upon which the simulator can execute both parallel and dependent tasks. Utilizing this simulator, we present two case studies that were time-consuming previously but are now easily achieved by the proposed simulator. The first case study reveals the possibility of using policy-based reinforcement learning to explore novel scheduling strategies; the second explores the dominant processors within heterogeneous architectures, providing insights for optimizing the heterogeneous architecture design.
09:05 CEST	TS17.8	FAST INTERPRETER-BASED INSTRUCTION SET SIMULATION FOR VIRTUAL PROTOTYPES Speaker: Manfred Schlägl, Institute for Complex Systems, Johannes Kepler University Linz, AT Authors: Manfred Schlaegl and Daniel Grosse, Johannes Kepler University Linz, AT Abstract The Instruction Set Simulators (ISSs) used in Virtual Prototypes (VPs) are typically implemented as interpreters with the goal to be easy to understand, and fast to adapt and extend. However, the performance of instruction interpretation is very limited and the ever-increasing complexity of Hardware (HW) poses an increasing challenge to this approach. In this paper, we present optimization techniques for interpreter-based ISSs that significantly boost performance while preserving comprehensibility and adaptability. We consider the RISC-V ISS of an existing, SystemC-based open-source VP with extensive capabilities such as running Linux and interactive graphical applications. The optimization techniques feature a Dynamic Basic Block Cache (DBBCache) to accelerate ISS instruction processing and a Load/Store Cache (LSCache) to speed up ISS load and store operations to and from memory. In our evaluation, we consider 12 Linux-based benchmark workloads and compare our optimizations to the original VP as well as to the very efficient official RISC-V reference simulator Spike maintained by RISC-V International. Overall, we achieve up to 406.97 Million Instructions per Second (MIPS) and a signif- icant average performance increase, by a factor of 8.98 over the original VP and 1.65 over the Spike simulator. To showcase the retention of both comprehensibility and adaptability, we imple- ment support for RISC-V half-precision floating-point extension (Zfh) in both the original and the optimized VP. A comparison of these implementations reveals no significant differences, ensuring that the stated qualities remain unaffected. The optimized VP including Zfh is available as open-source on GitHub.
09:10 CEST	TS17.9	C2C-GEM5: FULL SYSTEM SIMULATION OF CACHE-COHERENT CHIP-TO-CHIP INTERCONNECTS Speaker: Luis Bertran Alvarez, LIRMM, FR Authors: Luis Bertran Alvarez¹, Ghassan Chehaibar², Stephen Busch², Pascal Benoit³ and David Novo³ ¹LIRMM / Eviden, FR; ²Eviden, FR; ³Université de Montpellier, FR Abstract High-Performance Computing (HPC) is shifting toward chiplet-based System-on-Chip (SoC) architectures, necessitating advanced simulation tools for design and optimization. In this work, we extend the gem5 simulator to support cache-coherent multi-chip systems by introducing a new chip-to-chip interconnect model within the Ruby framework. Our implementation is adaptable to various coherence protocols, such as Arm CHI. Calibrated with real hardware, our model is evaluated using PARSEC workloads, demonstrating its accuracy in simulating coherent chip-to-chip interactions and its effectiveness in capturing key performance metrics early in the design flow.
09:15 CEST	TS17.10	A 101 TOPS/W AND 1.73 TOPS/MM$^2$ 6T SRAM-BASED DIGITAL COMPUTE-IN-MEMORY MACRO FEATURING A NOVEL 2T MULTIPLIER Speaker: Priyanshu Tyagi, IIT Roorkee, IN Authors: Priyanshu Tyagi and Sparsh Mittal, IIT Roorkee, IN Abstract In this paper, we propose a 6T SRAM-based all-digital Compute-in-memory (CIM) macro for multi-bit multiply-and-accumulate (MAC) operations. We propose a novel 2T bitwise multiplier, which is a direct improvement over the previously proposed 4T NOR gate-based multiplier. The 2T multiplier also eliminates the need to invert the input bits, which is required when using NOR gates for multipliers. We propose an efficient digital MAC computation flow based on a barrel shifter, which significantly reduces the latency of shift operation. This brings down the overall latency incurred while performing MAC operations to 13ns/25ns (in 65nm CMOS)for 4b/8b operands (in 65nm CMOS @ 0.6V), compared to 10ns/18ns (in 22nm CMOS @ 0.72V) of the previous work. The proposed CIM macro is fully re-configurable in weight bits (4/8/12/16) and input (4/8) bits. It can perform concurrent MAC and weight update operations. Moreover, its fully complete digital implementation circumvents the challenges associated with analog CIM macros. For MAC operation with 4b weight and input, the macro achieves 24 TOPS/W at 1.2 V and 81 TOPS/W at 0.7 V. When using low-threshold-voltage transistors in the 2T multiplier, the macro works reliably even at 0.6V while achieving 101 TOPS/W.

TS18 Machine learning solutions for embedded and cyber-physical systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Rhône 3AB

Session chair:
Ioannis Sourdis, Chalmers University of Technology, SE

Session co-chair:
Gianna Paulin, Axelera AI, CH

Time	Label	Presentation Title Authors
08:30 CEST	TS18.1	DE$^2$R: UNIFYING DVFS AND EARLY-EXIT FOR EMBEDDED AI INFERENCE VIA REINFORCEMENT LEARNING Speaker: Yuting He, University of Nottingham Ningbo China, CN Authors: Yuting He¹, Jingjin Li¹, Chengtai Li¹, Qingyu Yang¹, Zheng Wang², Heshan Du¹, Jianfeng Ren¹ and Heng Yu¹ ¹University of Nottingham Ningbo China, CN; ²Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, CN Abstract Executing neural networks on resource-constrained embedded devices faces challenges. Efforts have been made at the application and system levels to reduce the execution cost. Among them, the early-exit networks reduce computational cost through intermediate exits, while Dynamic Voltage and Frequency Scaling (DVFS) offers system energy reduction. Existing works strive to unify early-exit and DVFS for combined benefits on both timing and energy flexibility, yet limitations exist: 1) varying time constraints that make different exit points become more, or less, important in terms of inference accuracy, are not taken care of, and 2) the optimal decisions of unifying DVFS and early-exit as a multi-objective optimization problem are not achieved due to the large configuration space. To address these challenges, we propose De$^2$r, a reinforcement learning-based framework that jointly optimizes early-exit points and DVFS settings for continuous inference. In particular, De$^2$r includes a cross-training mechanism that fine-tunes the early-exit network to accommodate dynamic time constraints and system conditions. Experimental results demonstrate that De$^2$r achieves up to 22.03% energy reduction and 3.23% accuracy gain compared to contemporary techniques.
08:35 CEST	TS18.2	CONTINUOUS GNN-BASED ANOMALY DETECTION ON EDGE USING EFFICIENT ADAPTIVE KNOWLEDGE GRAPH LEARNING Speaker: Sanggeon Yun, University of California, Irvine, US Authors: Sanggeon Yun¹, Ryozo Masukawa¹, William Chung¹, Minhyoung Na², Nathaniel Bastian³ and Mohsen Imani¹ ¹University of California, Irvine, US; ²Kookmin University, KR; ³United States Military Academy at West Point, US Abstract The increasing demand for robust security solutions across various industries has made Video Anomaly Detection (VAD) a critical task in applications such as intelligent surveillance, evidence investigation, and violence detection. Traditional approaches to VAD often rely on finetuning large pre-trained models, which can be computationally expensive and impractical for real-time or resource-constrained environments. To address this, MissionGNN introduced a more efficient method by training a graph neural network (GNN) using a fixed knowledge graph (KG) derived from large language models (LLMs) like GPT-4. While this approach demonstrated significant efficiency in computational power and memory, it faces limitations in dynamic environments where frequent updates to the KG are necessary due to evolving behavior trends and shifting data patterns. These updates typically require cloud-based computation, posing challenges for edge computing applications. In this paper, we propose a novel framework that facilitates continuous KG adaptation directly on edge devices, overcoming the limitations of cloud dependency. Our method dynamically modifies the KG through a three-phase process: pruning, alternating, and creating nodes, enabling real-time adaptation to changing data trends. This continuous learning approach enhances the robustness of anomaly detection models, making them more suitable for deployment in dynamic and resource-constrained environments.
08:40 CEST	TS18.3	BMP-SD: MARRYING BINARY AND MIXED-PRECISION QUANTIZATION FOR EFFICIENT STABLE DIFFUSION INFERENCE Speaker: Cheng Gu, Shanghai Jiao Tong University, CN Authors: Cheng Gu¹, Gang Li², Xiaolong Lin¹, Jiayao Ling¹, Jian Cheng³ and Xiaoyao Liang¹ ¹Shanghai Jiao Tong University, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN; ³Institute of Automation, CN Abstract Stable Diffusion (SD) is an emerging deep neural network (DNN) model that has demonstrated impressive capabilities in generative tasks such as text-to-image generation. However, the iterative denoising stage with UNet in the SD model is extremely expensive in both computations and memory accesses, making it challenging for fast and energy-efficient edge deployment. To alleviate the overhead of denoising, in this paper we propose BMP-SD, a post-training quantization framework for hardware-efficient SD inference. BMP-SD employs binary weight quantization to significantly reduce the computational complexity and memory footprint of iterative denoising, along with dynamic, step-aware mixed-precision activation quantization, based on the observation that not all denoising steps are equally important. Experiments on the text-to-image generation task show that BMP-SD achieves mixed-precision (W1.73A4.87) with minimal accuracy loss on MS-COCO 2014. We also evaluate the BMP-SD quantized model on multiple bit-flexible DNN accelerators, results reveal that our method can deliver up to 5.14x performance and 3.85x energy efficiency improvements compared to W8A8 quantization.
08:45 CEST	TS18.4	DISTRIBUTED INFERENCE WITH MINIMAL OFF-CHIP TRAFFIC FOR TRANSFORMERS ON LOW-POWER MCUS Speaker: Victor Jung, ETH Zurich, CH Authors: Severin Bochem¹, Victor Jung¹, Arpan Suravi Prasad¹, Francesco Conti² and Luca Benini³ ¹ETH Zurich, CH; ²Università di Bologna, IT; ³ETH Zurich, CH \| Università di Bologna, IT Abstract Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technological revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power microcontroller units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, an above-linear speedup of 26.07 X, and an energy-delay-product (EDP) improvement of 27.22 X, compared to a single-chip system. On MobileBERT, the distributed system's runtime is 38.84 ms, with an above-linear 4.69 X speedup when using 4 MCUs compared to a single-chip system.
08:50 CEST	TS18.5	HIFI-SAGE: HIGH FIDELITY GRAPHSAGE-BASED LATENCY ESTIMATORS FOR DNN OPTIMIZATION Speaker: Shambhavi Balamuthu Sampath, BMW Group, DE Authors: Shambhavi Balamuthu Sampath¹, Leon Hecht², Moritz Thoma¹, Lukas Frickenstein¹, Pierpaolo Mori³, Nael Fasfous¹, Manoj Rohit Vemparala¹, Alexander Frickenstein¹, Claudio Passerone³, Daniel Mueller-Gritschneder⁴ and Walter Stechele² ¹BMW Group, DE; ²TU Munich, DE; ³Politecnico di Torino, IT; ⁴TU Wien, AT Abstract As deep neural networks (DNNs) are increasingly deployed on resource-constrained edge devices, optimizing and compressing them for real-time performance becomes crucial. Traditional hardware-aware DNN search methods often rely on inaccurate proxy metrics, expensive latency lookup tables, or slow hardware-in-the-loop (HIL) evaluations. To address this, quasi- generalized latency estimators, typically meta-learning-based, were proposed to replace HIL evaluations and accelerate the search. These come with a one-time data collection and training cost and can adapt to new hardware with few measurements. However, they still have some drawbacks: (1) They increase complexity by trying to generalize across a range of diverse hardware types; (2) They depend on handcrafted hardware descriptors, which may fail to capture hardware characteristics; (3) They often perform poorly on new, unseen hardware that significantly differs from their initial training set. To overcome these challenges, this paper turns to the more straightforward platform-specific estimators that do not require hardware descriptors and can be easily trained on any hardware. We introduce HiFi-SAGE, a high fidelity GraphSAGE-based platform-specific latency estimator. When trained from scratch on only 100 latency measurements, our novel dual-head estimator design surpasses the state-of-the-art (SoTA) on the 10% error bound metric by up to 17.4 p.p. while achieving an impressive fidelity score of 99% on the diverse LatBench dataset. We demonstrate that applying HiFi-SAGE to a genetic algorithm-based DNN compression search, achieved a Pareto front comparable to real HIL feedback with a mean absolute percentage error (MAPE) of 2.54%, 2.48%, and 4.16%, for InceptionV3, DenseNet169, and ResNet50 respectively. Compared to existing platform-specific works, the lower number of latency measurements and higher fidelity scores positions HiFi-SAGE as an attractive alternative to replace expensive HIL setups. Code is available at: https://github.com/shamvbs/HiFi-SAGE.
08:55 CEST	TS18.6	SOLARML: OPTIMIZING SENSING AND INFERENCE FOR SOLAR-POWERED TINYML PLATFORMS Speaker: Hao Liu, TU Delft, NL Authors: Hao Liu, Qing Wang and Marco Zuniga, TU Delft, NL Abstract Machine learning models can now run on microcontrollers. Thanks to the advances in neural architectural search, we can automatically identify tiny machine learning (tinyML) models that satisfy stringent memory and energy requirements. However, existing methods often overlook the energy used during event detection and data gathering. This is critical for devices powered by renewable energy sources like solar power, where energy efficiency is paramount. To address it, we introduce SolarML, a solution designed specifically for solar-powered tinyML platforms, which optimizes the end-to-end system's inference accuracy and energy consumption, from data gathering and processing to model inference. Considering two applications of gesture recognition and keywords spotting, SolarML has the following contributions: 1) a hardware platform with an optimal event detection mechanism that reduces event detection costs by up to 10× compared to state-of-the-art alternatives; 2) a joint optimization framework eNAS that reduces the energy consumption of the sensor and inference model by up to 2×, compared to methods that only optimize the inference model. Jointly, they enable SolarML to run end-to-end gesture and audio inference on a battery-free tinyML platform by only harvesting solar energy for 30 and 57 seconds, respectively, in an office environment (500 lux).
09:00 CEST	TS18.7	SAFELOC: OVERCOMING DATA POISONING ATTACKS IN HETEROGENEOUS FEDERATED MACHINE LEARNING FOR INDOOR LOCALIZATION Speaker: Akhil Singampalli, Colorado State University, US Authors: Akhil Singampalli, Danish Gufran and Sudeep Pasricha, Colorado State University, US Abstract Machine learning (ML) based indoor localization solutions are critical for many emerging applications, yet their efficacy is often compromised by hardware/software variations across mobile devices (i.e., device heterogeneity) and the threat of ML data poisoning attacks. Conventional methods aimed at countering these challenges show limited resilience to the uncertainties created by these phenomena. In response, we introduce SAFELOC, a novel framework that not only minimizes localization errors under these challenging conditions but also ensures model compactness for efficient mobile device deployment. SAFELOC introduces a novel fused neural network architecture that performs data poisoning detection and localization, with a low model footprint using federated learning (FL). Additionally, a dynamic saliency map-based aggregation strategy is designed to adapt based on the severity of the detected data poisoning scenario. Experimental evaluations demonstrate that SAFELOC achieves improvements of up to 5.9× in mean localization error, 7.8× in worst-case localization error, and a 2.1× reduction in model inference latency compared to state-of-the-art indoor localization frameworks across diverse indoor environments and data poisoning attack scenarios.
09:05 CEST	TS18.8	HYBRID TOKEN SELECTOR BASED ACCELERATOR FOR VITS Speaker: Anadi Goyal, Indian Institute of Technology Jodhpur, IN Authors: Akshansh Yadav, Anadi Goyal and Palash Das, Indian Institute of Technology, Jodhpur, IN Abstract Vision Transformers (ViTs) have shown great success in computer vision but suffer from high computational complexity due to the quadratic growth in the number of tokens processed. Token selection/pruning has emerged as a promising solution; however, early methods introduce significant overhead and complexity. Applying a token selector in the early layers of a ViT can yield substantial computational savings (GFLOPs) compared to using it in later layers. However, this approach often leads to significant accuracy loss, particularly with the popular Attention-based Token Selection (ATS) technique. To address these issues, we propose a hybrid token selection (HTS) strategy that integrates our Keypoint-based Token Selection (KTS) with the existing ATS method. KTS dynamically selects important tokens based on image content in the early layers, while ATS refines token pruning in the later layers. This hybrid approach reduces computational costs while maintaining accuracy. Additionally, we design custom hardware modules to accelerate the execution of the proposed methods and the ViT backbone. The proposed HTS delivers a 35.85% reduction in execution time relative to the baseline without any token selection. Furthermore, our results demonstrate that HTS achieves up to a 0.39% increase in accuracy and offers up to 6.05% savings in GFLOPs compared to existing method.
09:10 CEST	TS18.9	DAOP: DATA-AWARE OFFLOADING AND PREDICTIVE PRE-CALCULATION FOR EFFICIENT MOE INFERENCE Speaker: Yujie Zhang, National University of Singapore, SG Authors: Yujie Zhang, Shivam Aggarwal and Tulika Mitra, National University of Singapore, SG Abstract Mixture-of-Experts (MoE) models, though highly effective for various machine learning tasks, face significant deployment challenges on memory-constrained devices. While GPUs offer fast inference, their limited memory compared to CPUs means not all experts can be stored on the GPU simultaneously, necessitating frequent, costly data transfers from CPU memory, often negating GPU speed advantages. To address this, we present DAOP, an on-device MoE inference engine to optimize parallel GPU-CPU execution. DAOP dynamically allocates experts between CPU and GPU based on per-sequence activation patterns, and selectively pre-calculates predicted experts on CPUs to minimize transfer latency. This approach enables efficient resource utilization across various expert cache ratios while maintaining model accuracy through a novel graceful degradation mechanism. Comprehensive evaluations across various datasets show that DAOP outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
09:15 CEST	TS18.10	SPIKESTREAM: ACCELERATING SPIKING NEURAL NETWORK INFERENCE ON RISC-V CLUSTERS WITH SPARSE COMPUTATION EXTENSIONS Speaker: Simone Manoni, Università di Bologna, IT Authors: Simone Manoni¹, Paul Scheffler², Luca Zanatta³, Andrea Acquaviva¹, Luca Benini⁴ and Andrea Bartolini¹ ¹Università di Bologna, IT; ²ETH Zurich, CH; ³NTNU, NO; ⁴ETH Zurich, CH \| Università di Bologna, IT Abstract Spiking Neural Network (SNN) inference has a clear potential for high energy efficiency as computation is triggered by events. However, the inherent sparsity of events poses challenges for conventional computing systems, driving the development of specialized neuromorphic processors, which come with high silicon area costs and lack the flexibility needed for running other computational kernels, limiting widespread adoption. In this paper, we explore the low-level software design, parallelization, and acceleration of SNNs on general-purpose multicore clusters with a low-overhead RISC-V ISA extension for streaming sparse computations. We propose SpikeStream, an optimization technique that maps weights accesses to affine and indirect register-mapped memory streams to enhance performance, utilization, and efficiency. Our results on the end-to-end Spiking-VGG11 model demonstrate a significant 4.39× speedup and an increase in utilization from 9.28% to 52.3% compared to a non-streaming parallel baseline. Additionally, we achieve an energy efficiency gain of 3.46× over LSMCore and a performance gain of 2.38× over Loihi.
09:20 CEST	TS18.11	REACT: RANDOMIZED ENCRYPTION WITH AI-CONTROLLED TARGETING FOR NEXT-GEN SECURE COMMUNICATION Speaker: Hossein Sayadi, California State University, Long Beach, US Authors: Zhangying He and Hossein Sayadi, California State University, Long Beach, US Abstract This work introduces REACT (Randomized Encryption with AI-Controlled Targeting), a novel framework leveraging Deep Reinforcement Learning (DRL) and Moving Target Defense (MTD) to secure chaotic communication in resource-constrained environments. REACT employs a random generator to dynamically assign encryption modes, creating unpredictable patterns that thwart interception. At the receiver's end, four DRL agents collaborate to identify encryption modes and apply decryption methods, ensuring secure, synchronized communication. Evaluation results demonstrate up to 100% decryption accuracy and a 51% reduction in attack success probability, establishing REACT as a robust and adaptive defense for secure and reliable communication
09:21 CEST	TS18.12	DUSGAI: A DUAL-SIDE SPARSE GEMM ACCELERATOR WITH FLEXIBLE INTERCONNECTS Speaker: Wujie Zhong, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Wujie Zhong and Yangdi Lyu, The Hong Kong University of Science and Technology (Guangzhou), CN Abstract Sparse general matrix multiplication (SpGEMM) is a crucial operation of deep neural networks (DNNs), leading to the development of numerous specialized SpGEMM accelerators. These accelerators leverage flexible interconnects, thereby outperforming their rigid counterparts. However, the suboptimal utilization of sparsity patterns limits overall performance efficiency. In this work, we propose DuSGAI, a sparse GEMM accelerator that employs a parallel index intersection structure to utilize dual-side sparsity. Our evaluation of DuSGAI with five popular DNN models demonstrates a 3.03× performance improvement compared to the state-of-the-art SpGEMM accelerator.

TS19 Design and test of secure systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 10:00 CEST
Location / Room: Salon Pasteur

Session chair:
Giorgio Di Natale, CNRS / TIMA, FR

Session co-chair:
Christian Pilato, Politecnico di Milano, IT

Time	Label	Presentation Title Authors
08:30 CEST	TS19.1	DE2: SAT-BASED SEQUENTIAL LOGIC DECRYPTION WITH A FUNCTIONAL DESCRIPTION Speaker: Hai Zhou, Northwestern University, US Authors: You Li, Guannan Zhao, Yunqi He and Hai Zhou, Northwestern University, US Abstract Logic locking is a promising approach to protect the intellectual properties of integrated circuits. Existing logic locking schemes assume that an adversary must possess a cycle-accurate oracle circuit to launch an I/O attack. This paper presents DE2, a novel and rigorous attacking algorithm based on a new adversarial model. DE2 only takes a high-level functional specification of the victim chip. Such specifications are increasingly prevalent in the modern IC design flow. DE2 closes the timing gap between the specification and the circuit with an automatic alignment mechanism, which enables effective logic decryption without cycle-accurate information. An essential enabler of DE2 is a synthesis-based sequential logic decryption algorithm called LIM, which introduces only a minimal overhead in every iteration. Experiments show that DE2 can efficiently attack logic-locked benchmarks without access to a cycle-accurate oracle circuit. Besides, LIM can solve 20% more ISCAS'89 benchmarks than state-of-the-art sequential logic decryption algorithms.
08:35 CEST	TS19.2	HARDWARE/SOFTWARE RUNTIME FOR GPSA PROTECTION IN RISC-V EMBEDDED CORES Speaker: Louis Savary, INRIA, FR Authors: Louis Savary¹, Simon Rokicki² and Steven Derrien³ ¹INRIA, FR; ²IRISA, FR; ³Université de Bretagne Occidentale \| Lab-STICC, FR Abstract State-of-the-art hardware countermeasures against fault attacks are based, among others, on control flow and code integrity checking. Generalized Path Signature Analysis and Continuous Signature Monitoring can assert these integrity properties. However, supporting such mechanisms requires a dedicated compiler flow and does not support indirect jumps. This work proposes a technique based on a hardware/software runtime to generate those signatures while executing unmodified off-the-shelf RISC-V binaries. The proposed approach has been implemented on a pipelined processor, and experimental results show an average slowdown of x3 compared to unprotected implementations while being completely compiler-independent.
08:40 CEST	TS19.3	ANALOG CIRCUIT ANTI-PIRACY SECURITY BY EXPLOITING DEVICE RATINGS Speaker: Hazem Hammam, Sorbonne Université, CNRS, LIP6, FR Authors: Hazem Hammam¹, Hassan Aboushady¹ and Haralampos-G. Stratigopoulos² ¹Sorbonne Université, CNRS, LIP6, FR; ²Sorbonne University, CNRS, LIP6, FR Abstract We propose a novel anti-piracy security technique for analog and mixed-signal (AMS) circuits. The circuit is re-designed by obfuscating transistors and capacitors with key-controlled versions. We obfuscate both the device geometries and their ratings, which define the maximum allowable current, voltage, and power dissipation. The circuit is designed to function correctly only with a specific key. Loading any other incorrect key degrades performance and for the vast majority of these keys the chip is damaged because of electrical over-stress. This prevents counter-attacks that employ a chip to search for the correct key. The methodology is demonstrated on a low-dropout regulator (LDO) designed in the 22nm FDSOI technology by GlobalFoundries. By locking the LDO, the entire chip functionality breaks unless the LDO is unlocked first. The secured LDO shows no performance penalty and area overhead is justifiable and less than 25%, while it is protected against all known counter-attacks in the AMS domain.
08:45 CEST	TS19.4	SIDE-CHANNEL COLLISION ATTACKS AGAINST ASCON Speaker: Hao Zhang, Nanjing University of Science and Technology, CN Authors: Hao Zhang, Yiwen Gao, Yongbin Zhou and Jingdian Ming, Nanjing University of Science and Technology, CN Abstract Side-channel attack poses a significant threat to the security of electronic devices, particularly IoT/AIoT terminals. By leveraging side-channel leakages, collision attacks can efficiently extract the secret keys from cryptographic devices while requiring considerably less computational effort. In this paper, we investigate side-channel collision attacks against ASCON, a lightweight crypto designed for resource-constrained devices, which has been standardized by the NIST. For the first time, we propose a side-channel key recovery attack against ASCON by identifying the collisions in the linear diffusion layer. Using Pearson correlation coefficient and Euclidean distance for internal collision detections, our attack successfully recovers the secret key with approximately 5,000 power traces from an 8-bit software implementation on an AVR device. To further reduce attack complexity, we introduce a novel metric, Locally-Weighted Sum (LWS), which focuses on the most likely points of leakage, thereby decreasing the number of required power traces for successful attack. Our experiment on the same target demonstrates that the LWS-based collision attack can recover the full secret key with approximately 3,000 power traces, a reduction of 40 percent. Our study indicates that ASCON is susceptible to side-channel collision attacks, and bitslice implementations remain vulnerable to such threats.
08:50 CEST	TS19.5	CUTE-LOCK: BEHAVIORAL AND STRUCTURAL MULTI-KEY LOGIC LOCKING USING TIME BASE KEYS Speaker: Amin Rezaei, California State University, Long Beach, US Authors: Kevin Lopez and Amin Rezaei, California State University, Long Beach, US Abstract The outsourcing of semiconductor manufacturing raises security risks, such as piracy and overproduction of hardware intellectual property. To overcome this challenge, logic locking has emerged to lock a given circuit using additional key bits. While single-key logic locking approaches have demonstrated serious vulnerability to a wide range of attacks, multi-key solutions, if carefully designed, can provide a reliable defense against not only oracle-guided logic attacks, but also removal and dataflow attacks. In this paper, using time base keys, we propose, implement and evaluate a family of secure multi-key logic locking algorithms called Cute-Lock that can be applied both in RTL-level behavioral and netlist-level structural representations of sequential circuits. Our extensive experimental results under a diverse range of attacks confirm that, compared to vulnerable state-of-the-art methods, employing the Cute-Lock family drives attacking attempts to a dead end without additional overhead.
08:55 CEST	TS19.6	SAFELIGHT: ENHANCING SECURITY IN OPTICAL CONVOLUTIONAL NEURAL NETWORK ACCELERATORS Speaker: Salma Afifi, Colorado State University, US Authors: Salma Afifi¹, Ishan Thakkar² and Sudeep Pasricha¹ ¹Colorado State University, US; ²University of Kentucky, US Abstract The rapid proliferation of deep learning has revolutionized computing hardware, driving innovations to improve computationally expensive multiply-accumulate operations in deep neural networks. Among these innovations are integrated silicon-photonic systems that have emerged as energy-efficient platforms capable of achieving light speed computation and communication, positioning optical neural network (ONN) platforms as a transformative technology for accelerating deep learning models such as convolutional neural networks (CNNs). However, the increasing complexity of optical hardware introduces new vulnerabilities, notably the risk of hardware trojan (HT) attacks. Despite the growing interest in ONN platforms, little attention has been given to how HT-induced threats can compromise performance and security. This paper presents an in-depth analysis of the impact of such attacks on the performance of CNN models accelerated by ONN accelerators. Specifically, we show how HTs can compromise microring resonators (MRs) in a state-of-the-art non-coherent ONN accelerator and reduce classification accuracy across CNN models by up to 7.49% to 80.46% by just targeting 10% of MRs. We then propose techniques to enhance ONN accelerator robustness against these attacks and show how the best techniques can effectively recover the accuracy drops.
09:00 CEST	TS19.7	ONE MORE MOTIVATION TO USE EVALUATION TOOLS, THIS TIME FOR HARDWARE MULTIPLICATIVE MASKING OF AES Speaker: Hemin Rahimi, TU Darmstadt, DE Authors: Hemin Rahimi and Amir Moradi, TU Darmstadt, DE Abstract Safeguarding cryptographic implementations against the increasing threat of Side-Channel Analysis (SCA) attacks is essential. Masking, a countermeasure that randomizes intermediate values, is a cornerstone of such defenses. In particular, SCA-secure implementation of AES, the most-widely used encryption standard, can employ Boolean masking as well as multiplicative masking due to its underlying Galois field operations. However, multiplicative masking is susceptible to vulnerabilities, including the zero-value problem, which has been identified right after the introduction of multiplicative masking. At CHES 2018, De Meyer et al. proposed a hardware-based approach to manage these challenges and implemented multiplicative masking for AES, incorporating a Kronecker delta function and randomness optimization. In this work, we evaluate their design using the PROLEAD evaluation tool under the glitch- and transition-extended probing model. Our findings reveal a critical vulnerability in their first-order implementation of the Kronecker delta function, stemming from the employed randomness optimization. This leakage compromises the security of their presented masked AES Sbox. After pinpointing the source of such a leakage, we propose an alternative randomness optimization to address this issue, and demonstrate its effectiveness through rigorous evaluations by means of PROLEAD.
09:05 CEST	TS19.8	THREE EYED RAVEN: AN ON-CHIP SIDE CHANNEL ANALYSIS FRAMEWORK FOR RUN-TIME EVALUATION Speaker: M Dhilipkumar, IIT Kanpur, IN Authors: M Dhilipkumar, Priyanka Bagade and Debapriya Basu Roy, IIT Kanpur, IN Abstract Side-channel attacks exploit the physical leakages from hardware components, such as power consumption, to break secure cryptographic algorithms and retrieve its secret key. Therefore, evaluating implementations of cryptographic algorithms against such analysis is of paramount importance. A typical side-channel evaluation framework requires external devices like sampling oscilloscope along with a customized analysis board which makes the evaluation both expensive and time-consuming. However recent advancements in developing on-chip sensors on FPGAs for monitoring side channel information pave the path towards a fully on-chip side channel analysis framework without the requirement of any external devices, reducing both the cost and time required to carry out these experiments. In this paper, we propose our on-chip side channel analysis framework Raven that is augmented with hardware implementations of Test Vector Leakage Assessment (TVLA), Correlation Power Analysis (CPA), and Deep Learning based Leakage Assessment (DL-LA). The presence of on-chip hardware implementations of these side-channel evaluation algorithms coupled with on-chip sensors allows RAVEN to assess the side-channel security of the crypto-implementation in a fast and efficient manner. Our proposed implementation for DL-LA can also get trained on-chip and does not require the pre-trained weight values. The resource consumption of RAVEN is not high as the entire design along with the sensors can be fit into PYNQ board of AMD-Xilinx. We have validated the proposed RAVEN framework on AES-128 traces and results of the hardware implementation of TVLA, CPA, DL-LA closely resemble the results of software implementations, requiring significantly less time and storage.
09:10 CEST	TS19.9	RTL-BREAKER: ASSESSING THE SECURITY OF LLMS AGAINST BACKDOOR ATTACKS ON HDL CODE GENERATION Speaker: Lakshmi Likhitha Mankali, New York University, US Authors: Lakshmi Likhitha Mankali¹, Jitendra Bhandari¹, Manaar Alam², Ramesh Karri¹, Michail Maniatakos², Ozgur Sinanoglu² and Johann Knechtel² ¹New York University, US; ²New York University Abu Dhabi, AE Abstract Large language models (LLMs) have demonstrated remarkable potential with code generation/completion tasks for hardware design. However, the reliance on such automation introduces critical security risks. Notably, given that LLMs have to be trained on vast datasets of codes that are typically sourced from publicly available repositories, often without thorough validation, LLMs are susceptible to so-called data poisoning or backdoor attacks. Here, attackers inject malicious code for the training data, which can be carried over into the hardware description code (HDL) generated by LLMs. This threat vector can compromise the security and integrity of entire hardware systems. In this work, we propose RTL-Breaker, a novel backdoor attack framework on LLM-based HDL code generation. RTL-Breaker provides an in-depth analysis of essential aspects of this novel problem: 1) various trigger mechanisms versus their effectiveness for inserting malicious modifications, and 2) side-effects by backdoor attacks on code generation in general, i.e., impact on code quality. RTL-Breaker emphasizes the urgent need for more robust measures to safeguard against such attacks. Toward that end, we open-source our framework and all data.
09:15 CEST	TS19.10	MC3: MEMORY CONTENTION-BASED COVERT CHANNEL COMMUNICATION ON SHARED DRAM SYSTEM-ON-CHIPS Speaker: Ismet Dagli, Colorado School of Mines, US Authors: Ismet Dagli¹, James Crea¹, Soner Seckiner², Yuanchao Xu³, Selcuk Kose² and Mehmet Belviranli¹ ¹Colorado School of Mines, US; ²University of Rochester, US; ³University of California, Santa Cruz, US Abstract Shared memory system-on-chips (SM-SoCs) are ubiquitously employed by a wide range of computing platforms, including edge/IoT devices, autonomous systems, and smart-phones. In SM-SoCs, system-wide shared memory enables a convenient and financially feasible way to make data accessible across dozens of processing units (PUs), such as CPU cores and domain-specific accelerators. Due to the diverse computational characteristics of the PUs they embed, SM-SoCs often do not employ a shared last-level cache (LLC). While the literature studies covert channel attacks for shared memory systems, high-throughput communication is currently possible only through either relying on an LLC or having privileged/physical access to the shared memory subsystem. In this study, we introduce a new memory-contention-based covert communication attack, MC3, which specifically targets shared system memory in mobile SoCs. Unlike existing attacks, our approach achieves high-throughput communication without the need for an LLC or elevated access to the system. We explore the effectiveness of our methodology by demonstrating the trade-off between the channel transmission rate and the robustness of the communication. We evaluate MC3 on NVIDIA Orin AGX, NX, and Nano platforms and achieve transmission rates up to 6.4 Kbps with less than 1% error rate.
09:20 CEST	TS19.11	COMB FREQUENCY DIVISION MULTIPLEXING: A NON-BINARY MODULATION FOR AIRGAP COVERT CHANNEL TRANSMISSION Speaker: Mohamed-alla-eddine BAHI, Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164, F-35000 Rennes, France, FR Authors: Mohamed-alla-eddine BAHI¹, Maria MENDEZ REAL² and Maxime PELCAT² ¹Univ Rennes, INSA Rennes, IETR, UMR CNRS 6164, FR; ²IETR - UMR CNRS 6164, FR Abstract Isolated networks ensure the confidentiality of sensitive data on a system by eliminating all physical connections to public networks or external devices, making the system air-gapped. However, previous work has shown that Electromagnetic (EM) emanations when correlated with secret data, can lead to side or covert channels. Specifically, EM emissions caused by clocks can modulate high-frequency signals, enabling unauthorized data transmission to cross the air-gap. This work focuses on covert channels where a software or hardware Trojan inserted in the victim system induces side channel emissions that the attacker can recover through the covert channel, producing an intentional transmission and leakage of sensitive information. This paper introduces a novel encoding method for covert channels called Comb Frequency Division Multiplexing (CFDM). CFDM leverages modulated signals emitted by the victim system, which are evenly spaced across the frequency spectrum, creating a comb-like pattern. Moreover, the uncontrolled nature of the side channel modulation can make each subcarrier carry different information. Unlike traditional methods such as Frequency Shift Keying (FSK) and Amplitude Shift Keying (ASK), CFDM encodes information in both the frequency and amplitude dimensions of the covert channel harmonic sub-carriers.
09:21 CEST	TS19.12	MULTI-SENSOR DATA FUSION FOR ENHANCED DETECTION OF LASER FAULT INJECTION ATTACKS IN CRYPTOGRAPHIC HARDWARE: PRACTICAL RESULTS Speaker: Naghmeh Karimi, University of Maryland Baltimore County, US Authors: Mohammad Ebrahimabadi¹, Raphael Viera², Sylvain Guilley³, Jean Luc Danger⁴, Jean-Max Dutertre⁵ and Naghmeh Karimi¹ ¹University of Maryland Baltimore County, US; ²Ecole de Mines de Saint-Etienne, FR; ³Secure-IC, FR; ⁴Télécom ParisTech, FR; ⁵Mines Saint-Etienne, FR Abstract Though considered secure, cryptographic hardware can be compromised by adversaries injecting faults during runtime to leak secret keys from faulty outputs. Among fault injection methods, laser illumination has gained the most attention due to its precision in targeting specific areas and its fine temporal control. Accordingly, to tackle such attacks, this paper proposes a low-cost detection scheme that leverages Time-To-Digital Converters (TDC) to sense the IR drops caused by laser illumination. To mitigate the false alarm rate while maintaining a high detection rate, our method embeds multiple sensors (as few as two, as discussed in the text). To evaluate the impact of laser illumination and the effectiveness of our proposed scheme, we conducted extensive experiments (≈200k) using a real laser setup to illuminate the targeted AES module implemented on an Artix-7 FPGA. The results confirm the high accuracy of our detection method; achieving 82% fault detection with less than 0.01% false alarms and a detection latency of just 4 clock cycles. Notably, it enabled preventive actions in 70% of cases where illumination occurred but the AES outcome had not changed, greatly enhancing circuit security against key leakage.

W07 Designing Sustainable Intelligent Systems: Integrating Carbon Footprint Reduction, TinyML, and RISC-V

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 12:30 CEST
Location / Room: St Clair 2

Organisers:
Danilo Pau, STMicroelectronics, IT
Jose Miranda, ESL, EPFL, CH
Alberto Fernandez, INTERA Group, ES
Andrés Otero, University Polytechnic of Madrid, ES
Alfonso Rodriguez, Universidad Politécnica de Madrid, ES
Vincenzo Petrolo, Polytechnic of Turin, IT
Daniele Pagliari, Polytechnic of Turin, IT
Maurizio Martina, Polytechnic of Turin, IT
Davide Schiavone, OpenHW Group, IT
Miguel Peón-Quirós, EPFL, CH
David Atienza, EPFL, CH

Speakers:
Daniele Pagliari, Polytechnic of Turin, IT
Vincenzo Petrolo, Polytechnic of Turin, Italy, IT

As the world advances towards a more interconnected future with smarter sensors and devices, the convergence of embedded Artificial Intelligence (AI), represented by frameworks such as TinyML, open-source hardware architectures like RISC-V, and sustainability considerations, becomes increasingly vital. Designing systems with these three pillars in mind—Carbon Footprint reduction, TinyML, and RISC-V—has profound implications for creating more sustainable and energy-efficient intelligent systems. Closed and proprietary solutions often limit innovation and prevent the integration of eco-friendly practices by restricting access to foundational technologies. In contrast, open-source initiatives within the RISC-V ecosystem empower academia and industry to collaborate on developing energy-efficient solutions that align with global sustainability goals.

This workshop delves into the intersection of these three critical areas:

Carbon Footprint Reduction: Addressing the urgent need to minimise the environmental impact of digital systems through sustainable design practices.
TinyML: Leveraging Tiny Machine Learning to enable AI capabilities on resource-constrained devices, optimising performance while reducing energy consumption (particularly on data communication to the cloud or external elements distant from concerning the location where sensing data is collected).
RISC-V: Utilising the open-source RISC-V architecture to foster innovation in hardware design, allowing for customization and optimization towards energy efficiency.

By integrating these domains, participants will explore how to design and implement intelligent systems that are not only powerful and efficient but also environmentally responsible.

Key Objectives of the Workshop:

Interlinking the Three Pillars: Understand how the combination of Carbon Footprint considerations, TinyML, and RISC-V can lead to the development of sustainable intelligent systems.
Innovative Solutions for Sustainability: Explore methodologies and technologies that reduce energy consumption and environmental impact without compromising system performance.
Optimization of AI at the Edge: Learn about deploying embedded AI using TinyML on RISC-V platforms to achieve high efficiency in edge computing applications.
Collaborative Design Practices: Promote interdisciplinary collaboration to share best practices, tools, and techniques for integrating sustainability into system design.

W07.1 Workshop Kick-off

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 08:30 CEST - 09:00 CEST
Location / Room: St Clair 2

Speaker:
Jose Miranda, ESL, EPFL, CH

The workshop begins with an introduction to the growing importance of sustainability in intelligent system design. The kick-off highlights the critical roles of Carbon Footprint reduction, TinyML, and RISC-V, setting the stage for discussions on how these pillars drive energy-efficient and eco-friendly innovations. This opening session underscores the need for collaboration and open-source initiatives to meet global sustainability goals while pushing the boundaries of embedded AI and hardware design.

W07.2 Sustainable hybrid cloud-edge AI: Opportunities and challenges in HW/SW

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 09:00 CEST - 09:30 CEST
Location / Room: St Clair 2

Speaker:
Miguel Peón-Quirós, EPFL, CH

Achieving a truly sustainable AI continuum requires a holistic approach, where both cloud and edge infrastructures are designed with efficiency in mind. This session highlights the EcoCloud initiative at EPFL, showcasing how it serves as a living example of sustainability in cloud computing. We will discuss how sustainability principles must be integrated at every level—from hardware and system design to software optimization—to build energy-efficient hybrid cloud-edge AI systems. A key focus will be on EcoCloud's experimental facility, which provides a sustainable playground for researchers and industry partners to test and validate novel hardware-software co-design approaches. This facility enables real-world experimentation on energy-efficient architectures, allowing us to push the boundaries of sustainable AI. The talk will also emphasize how these technologies can be leveraged to create scalable, low-power, and environmentally responsible AI solutions. Attendees will be encouraged to actively engage in discussions, making this an interactive session.

W07.3 MYRTUS: Advancing Sustainable and Secure Computing with RISC-V

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 09:30 CEST - 10:00 CEST
Location / Room: St Clair 2

Speaker:
Andrés Otero, University Polytechnic of Madrid, ES

The MYRTUS project is paving the way for a more secure, sustainable, and efficient computing ecosystem by leveraging RISC-V and open-source hardware. In this session, we will explore how MYRTUS addresses challenges in trustworthy computing and energy-efficient architectures, focusing on its impact on edge AI, security, and sustainability. Join us to discover how this project shapes the future of hardware-software co-design for next-generation applications. (MYRTUS is funded by the European Union, by grant No. 101135183)

W07.4 Pushing TinyML Forward: End-to-end Near-Memory RISC-V Computing

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 10:30 CEST - 11:00 CEST
Location / Room: St Clair 2

Speakers:
Daniele Pagliari, Polytechnic of Turin, IT
Vincenzo Petrolo, Polytechnic of Turin, IT

In this talk, we will first describe a novel hardware architecture that merges in-memory computing with a RISC-V core to significantly reduce energy consumption and latency for TinyML tasks. Then, we detail MATCH, a flexible compiler, built on the TVM framework, designed to optimize AI workloads across heterogeneous edge systems prioritizing efficiency. Finally, we will demonstrate the full pipeline by deploying a deep neural network onto the presented hardware using MATCH, showcasing the flexibility of the compilation tool and the efficiency of the in-memory accelerator.

W07.5 The transition from Tiny ML to Edge GenAI

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 11:30 CEST
Location / Room: St Clair 2

Speaker:
Danilo Pau, STMicroelectronics, IT

Generative AI (GenAI) models are designed to produce realistic and natural data, such as images, audio, or written text. Due to their high computational and memory demands, these models traditionally run on powerful remote computing servers. However, there is growing interest in deploying GenAI models at the edge, on resource-constrained embedded devices. Since 2018, the TinyML community has proved that running fixed topology AI models on edge devices offers several benefits, including independence from the Internet connectivity, low-latency processing, and enhanced privacy. Nevertheless, deploying resource-consuming GenAI models on embedded devices is challenging since the latter have limited computational, memory, and energy resources. This talk reviews several papers about the progress made to date in the field of Edge GenAI, an emerging area of research within the broader domain of EdgeAI which focuses on bringing GenAI to edge devices. Papers released between 2022 and 2024 that addressed the design and deployment of GenAI models on embedded devices have been identified and described. Additionally, their approaches and results have been compared. These manuscripts contribute to understanding the ongoing transition from TinyML to Edge GenAI, providing the AI research community valuable insights into this emerging and impactful, quite under-explored field. Further examples of Edge GenAI will prove that some of these workloads can run on existing ST MCU and MPU processors, thus showing the EdgeGenAI research field is in active development.

W07.6 An SME journey on AI from Cloud 2 Edge

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:30 CEST - 12:00 CEST
Location / Room: St Clair 2

Speaker:
Alberto Fernandez, INTERA Group, ES

At INTERA we are committed to offering a complete IoT stack, from cloud to edge. At the foundation of the company, we started to work on the development of a proprietary IoT platform, oriented towards the EDGE, i.e. towards communication and control of remote devices. The natural evolution of the platform was the inclusion of AI capabilities, providing execution of models at the cloud level, including automated training and deployment based on continuously acquired data. Recently we started the journey towards the execution of artificial intelligence models at the edge, the so-called EDGE AI, incorporating and adapting open source TinyML/RISC-V worldwide resources, paving the foundation for the development of our own hardware and associated toolchain at an affordable pace. TinyML/RISC-V allows us to co-develop optimised hardware and AI models, connecting them to INTERA's (or third-party) cloud services and products to offer a "full-stack" AI solution. The goal of this journey is to offer technologies and solutions that impact sustainability.

W07.7 Roundtable discussion: Future Directions in Sustainable Intelligent Systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 12:00 CEST - 12:30 CEST
Location / Room: St Clair 2

Moderator:
Jose Miranda, ESL, EPFL, CH

The final session of the workshop will be an interactive roundtable discussion, bringing together experts and attendees to reflect on key insights from the day’s talks. This session will focus on identifying open challenges, future research directions, and collaborative opportunities at the intersection of sustainability-enabling technologies. Participants will have the opportunity to engage directly with speakers and panellists, discussing how the workshop's key themes can drive innovation in embedded AI and eco-friendly computing.

FS09 Focus Session: Empowering European Innovation by Making Leading-Edge Technologies Accessible to Academia

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 1

Organiser:
Catherine Le Lan, Synopsys, Fr

Academia and fundamental research are the driving forces behind groundbreaking advancements. Innovation is the cornerstone of the semiconductor industry's success. By providing access to leading-edge technologies, we support and accelerate semiconductor research, paving the way for a brighter, technologically advanced future.During this session, we will discuss how pilot lines can play a major role in enabling technology access to academia. Additionally, we will explore the perspective of academia on how this access can drive fundamental research and its contributions.

Time	Label	Presentation Title Authors
11:00 CEST	FS09.1	HOW CAN ACADEMIA RIDE THE WAVE OF NEW SEMICONDUCTOR INVESTMENT AND TECHNOLOGICAL CHANGE Presenter: John Darlington, University of Southampton, GB Author: John Darlington, University of Southampton, GB Abstract "Quoting from Aristotle, first mover in scientific method and formal logic, "The more you know, the more you realize you don't know". After a relatively long period of incremental change in electronic design and computing architectures we are now in a period of significant change. New forms of transistors, new memories, new methods of connection and packaging, new compute demands and fabrics, new design considerations, making the scope of our design discipline ever broader. What are some of the opportunities and also the challenges for Academia in this period of change and how can investments in semiconductor innovation in Europe benefit. Academia will have similar capacity to educate students, mentor PhDs and undertake fundamental research. Adoption of community-based design, broader collaboration networks, and simplified access to capabilities and skills, can help make best use of that capacity, leading to greater adoption of technological advances. Similarly they can help provide a pathway from fundamental research to demonstrable impact for society. Examples of more collaborative research will help inform the debate on how to benefit from change and investment in semiconductors. "
11:20 CEST	FS09.2	NANOIC PLATFORM Presenter: Giuseppe Fiorentino, Imec, BE Author: Giuseppe Fiorentino, Imec, BE Abstract "In the context of the Chips JU Act, a European consortium of Research Centers including IMEC, CEA-Leti, Tyndall Institute, CSNNT, Fraunhofer Institute and VTT, is developing an ambitious five year program focused on the development of of beyond-2nm systems-on-chip (SoC). The NanoIC pilot line aims in the first place lowers the threshold to innovation by offering early-stage process design kits (PDKs) to companies that want to explore novel solutions. Start-ups, SMEs, universities, and design and system companies can use: • design pathfinding PDKs for early design exploration in future IC technologies • system exploration PDKs for prototyping of advanced technology components on top of or embedded in the stack of commercially available foundry wafers The result is a leading technology platform where European and international companies can explore beyond-2nm SoC technologies before they're introduced into large-scale production. This will: • Enable the European supplier ecosystem to enhance it competitiveness while boosting the global chip value chain. • Offer companies the opportunity to explore the most advanced chip technology solutions for their future applications. The NanoIC pilot line will be based on an extension of the pilot line facilities that imec built up over the previous decades. Through improvements in terms of repeatability, variability and defectivity of the process modules, the baseline flows aim for a technology readiness level of 4-6 – bringing breakthrough technologies closer to the market. As the landscape of RTOs, SMEs and Universities is widely diffused over the European territory, the EU commission gave recently the green light to the formation of the Competence Centers (CCs), which have as key mission to connect and direct interested users to the most appropriate semiconductor technology offer. In this context, the NanoIC Team is already working to establish a stable collaboration with all the newly formed CCs. To further enhance the possibility to access and exploit the pilot line possibilities, the NanoIC project is developing a host of initiatives aimed at education and workforce development: courses on beyond-2nm technologies, access to affordable design tools and prototyping services for universities, internships, and training programs, etc. "
11:40 CEST	FS09.3	MULTI HUB PLATFOM TO FACILITATE CIRCUIT DESIGN Presenter: Sergio Nicoletti, CEA, FR Author: Sergio Nicoletti, CEA, FR Abstract The TEF-PREVAIL project brings together CEA-Leti, imec, Fraunhofer-Gesellschaft and VTT with the goal of accelerating the development of next-generation edge-AI technologies. At the heart of the PREVAIL project is a networked, multi-hub platform designed to provide stakeholders across the EU with the capability to fabricate prototype chips for advanced artificial intelligence technologies. Leveraging the power of clean-room tools, on advanced fabrication processes and on expertise and resources of its founding members, the consortium aims to facilitate the design, evaluation, testing, and rapid fabrication of state-of-the-art circuits.
12:00 CEST	FS09.4	THE FAMES PILOT LINE - ADVANCED FD-SOI SEMICONDUCTOR TECHNOLOGIES WITH EMBEDDED NON-VOLATILE MEMORIES, RF COMPONENTS, 3D HETEROGENEOUS INTEGRATION OPTIONS AND POWER MANAGEMENT IC COMPONENTS. Presenter: Bruno Paing, CEA, FR Author: Bruno Paing, CEA, FR Abstract "The EU Chips Act aims to strengthen Europe's semiconductor industry and support its technological sovereignty. The Chips Joint Undertaking, or Chips JU, is boosting the development and adoption of advanced systems and nano-electronic chip technologies manufactured in Europe by supporting, together with participating States, a series of Pilot Line developments in Europe. FAMES, one of the initial 4 Pilot Lines launched via the first Pilot Line call, gathers a technical infrastructure and a leading edge R&D program that offers European semiconductor stakeholders from industry, research, and academia access to a unique palette of advanced FD-SOI semiconductor technologies with embedded non-volatile Memories, RF components, 3D integration and PMIC components. The FAMES Pilot Line is distributed among a consortium of 11 public research partners from 8 European countries: CEA-Leti (coordinating partner), CEZAMAT-WUT, Fraunhofer, Grenoble INP, Imec, SAL, SiNANO Institute, Tyndall, UC Louvain, Universidad de Granada and VTT. Each partner brings a key and complementary skill to maximize the impact of the FAMES Pilot Line. The FAMES Pilot Line will strengthen the current ecosystem and support key European semiconductor players, leveraging highly differentiating technologies to address the rapid growth of low power, high connectivity and robustly secure integrated circuits driven by the automotive, IoT and smart mobile device markets, to name a few. The project has received 44 letters of support from industrial companies covering all the electronic systems value chain, including Nokia, Ericsson, Nordic, Soitec, ASML, ASM, AMAT, TEL, GlobalFoundries, IBM, Intel, STMicroelectronics, Siemens, Orange, Meta, Stellantis, Valeo. The R&D Services offered by the FAMES Pilot Line include pathfinding PDKs and chip design, PDKs to access silicon via multi-project wafers, advanced manufacturing processes, characterization and testing capabilities and regular training courses and workshops to build the skills and competencies required to ensure the future of semiconductors. The FAMES Pilot Line Partners reach out to Potential Users by way of dedicated workshops, talks and booths at the major electronics conferences and forums and through Design Platforms and Competence Centres. In this talk we will present the different technologies offered by the FAMES Pilot Line and the manner in which Potential Users from companies, research centers and academia can gain access to the FAMES Pilot Line R&D Services."

MPP02 Multi-Partner Projects

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Roseraie 1&2

Session chair:
Leticia Maria Bolzani Poehls, IHP - Leibniz Institute for High Performance Microelectronics, DE

Session co-chair:
Maksim Jenihhin, TalTech, EE

Time	Label	Presentation Title Authors
11:00 CEST	MPP02.1	MULTI-PARTNER PROJECT: SECURING FUTURE EDGE-AI PROCESSORS IN PRACTICE (CONVOLVE) Speaker: Sven Argo, Ruhr-University Bochum, DE Authors: Sven Argo¹, Henk Corporaal², Alejandro Garza³, Manil Dev Gomony⁴, Tim Güneysu¹, Adrian Marotzke³, Fouwad Mir⁵, Jan Richter-Brockmann¹, Jeffrey Smith² and Mottaqiallah Taouil⁶ ¹Ruhr University Bochum, DE; ²Eindhoven University of Technology, NL; ³NXP Semiconductors, DE; ⁴Eindhoven Unversity of Technology, NL; ⁵Delft University of Technology (TU Delft), NL; ⁶TU Delft, NL Abstract Artificial Intelligence (AI) has had a profound impact on our contemporary society, and it is indisputable that it will continue to play a significant role in the future. To further enhance AI experience and performance, a transition from large-scale server applications towards AI-powered edge devices is inevitable. In fact, current projections indicate that the market for Smart Edge Processors (SEPs) will grow beyond 70 Billion USD by 2026 [1]. Such a shift comes with major challenges, as these devices have limited computing and energy resources yet need to be highly performant. Additionally, security mechanisms need to be implemented to protect against diverse attack vectors as attackers now have physical access to the device. Besides cryptographic keys, Intellectual Property (IP), including neural network weights, may also be potential targets. The CONVOLVE [2] project (currently in its intermediate stage) follows a holistic approach to address these challenges and establish the EU in a leading position in embedded, ultra-low-power and secure processors for edge computing. It encompasses novel hardware technologies, end-to-end integrated workflows, and a security-by-design approach. This paper highlights the security aspects of future edge-AI processors by illustrating challenges encountered in CONVOLVE, the solutions we pursue including some early results, and directions for future research.
11:05 CEST	MPP02.2	MULTI-PARTNER PROJECT: OPEN-SOURCE DESIGN TOOLS FOR CO-DEVELOPMENT OF AI ALGORITHMS AND AI CHIPS Speaker: Mehdi Tahoori, Karlsruhe Institute of Technology, DE Authors: Mehdi Tahoori¹, Joerg Henkel¹, Jürgen Teich², Juergen Becker¹, Ulf Schlichtmann³, Norbert Wehn⁴, Georg Sigl⁵ and Wolfgang Kunz⁴ ¹Karlsruhe Institute of Technology, DE; ²Friedrich-Alexander-Universität Erlangen-Nürnberg, DE; ³TU Munich, DE; ⁴University of Kaiserslautern-Landau, DE; ⁵TU Munich/Fraunhofer AISEC, DE Abstract Chip technologies are crucial for the digital transformation of industry and society. Artificial Intelligence (AI) is playing an increasingly important role in both our daily lives and in industry. The development of advanced AI chip designs, essential for the successful deployment of AI, is of critical importance for innovation and competitiveness. However, challenges arise from the complexity of hardware development, expensive access to state-of-the-art design tools, and a global shortage of hardware experts. In addition to cost optimization, computational power, and energy consumption, security and trustworthiness are becoming increasingly important. This project aims to address these challenges in AI chip design by enabling efficient hardware development. We are developing a seamless transition between software-based AI model development and optimization, and efficient hardware implementation, while considering security, trustworthiness, and energy efficiency. An open-source approach plays a key role, facilitating access for small and medium-sized enterprises (SMEs) and expanding the community involved in AI chip design to help mitigate the shortage of skilled professionals.
11:10 CEST	MPP02.3	MULTI-PARTNER PROJECT: SUSTAINABLE TEXTILE ELECTRONICS (STELEC) Speaker: Bo Zhou, German Research Centre for Artificial Intelligence (DFKI), DE Authors: Bo Zhou¹, Mengxi Liu¹, Sizhen Bian¹, Daniel Geißler¹, Paul Lukowicz¹, Jose Miranda², Jonathan Dan³, David Atienza³, Mohamed Riahi⁴, Nobert Wehn⁴, Russel Torah⁵, Sheng Yong⁵, Jidong Liu⁵, Stephen Beeby⁵, Magdalena Kohler⁶, Berit Greinke⁶, Junchun Yu⁷, Vincent Nierstrasz⁷, Leila Sheldrick⁸, Rebecca Stewart⁸, Tommaso Nieri⁹, Matteo Maccanti⁹ and Daniele Spinelli⁹ ¹DFKI, DE; ²EPFl, CH; ³EPFL, CH; ⁴RPTU, DE; ⁵University of Southampton, GB; ⁶UDK, DE; ⁷University of Boras, DE; ⁸Imperial College London, GB; ⁹Next Technology Tecnotessile, IT Abstract E-textiles are rapidly emerging as an important area of electronic circuit applications. It also facilitates many socially important applications such as personalized health, elderly care, and smart agriculture. However, the environmental impact and sustainability of e-textiles remain very problematic. STELEC, short for Sustainable Textile ELECtronics, is an interdisciplinary research project funded by the European Innovation Council (EIC) under the Pathfinder programme on the responsible electronics topic seeking cutting-edge innovation. STELEC started in September 2024 and is in its initial stage.The project is a multinational collaboration of research institutes, universities and companies across Europe. It aims at developing next-generation textile-based electronics in applications from sensing, processing to AI, with a commitment to full lifecycle sustainability.
11:15 CEST	MPP02.4	MULTI-PARTNER PROJECT: TWINNING FOR EXCELLENCE IN RELIABLE ELECTRONICS (TWIN-RELECT) Speaker: Marko Andjelkovic, Leibniz-Institut für innovative Mikroelektronik, DE Authors: Marko Andjelkovic¹, Fabian Vargas¹, Milos Krstic¹, Luigi Dilillo², Alain Michez³, Frederic Wrobel³, Davide Bertozzi⁴, Mikel Lujan⁴, Christos Georgakidis⁵, Keterina Tsilingiri⁵, Nikolaos Chatzivangelis⁵, Nikolaos Zazatis⁵, Giorgos-Ioanis Pagliaroutis⁵, Pelopidas Tsoumanis⁵ and Christos Sotiriou⁵ ¹Leibniz-Institut für innovative Mikroelektronik, DE; ²CNRS, FR; ³Université de Montpellier, FR; ⁴The University of Manchester, GB; ⁵University of Thessaly, GR Abstract Reliable electronics plays a major role in shaping our daily lives, being a key enabler for critical applications, such as space missions, avionics, automotive, medicine, banking, auto-mated industry, wireless communication networks, etc. However, design of highly reliable electronic systems remains a challenge with the advances in semiconductor technology and increase in integrated circuit (IC) complexity. In this work, we introduce the Horizon Europe Twinning project TWIN-RELECT, aimed at strengthening the scientific expertise in designing reliable integ-rated circuits. The paper presents the general project concept and objectives, and main directions of the joint research activities. The primary scientific goal is to contribute to the development of novel, more efficient, European Electronic Design Automation (EDA) tool-chain for design of reliable chips.
11:20 CEST	MPP02.5	MULTI-PARTNER PROJECT: LOLIPOP-IOT – DESIGN AND SIMULATION OF ENERGY-EFFICIENT DEVICES FOR THE INTERNET OF THINGS Speaker: Jakub Lojda, Brno University of Technology, CZ Authors: Jakub Lojda, Josef Strnadel, Pavel Smrz and Vaclav Simek, Brno University of Technology, CZ Abstract This paper presents an overview of the Internet of Things (IoT) device design and simulation, with a specific focus on low-power design principles – everything in the context of the LoLiPoP-IoT project. The project aims to enhance IoT device usability by reducing maintenance requirements related to battery recharging or replacement. Another key goal is to significantly decrease the massive waste generated by discarded primary batteries, contributing to more sustainable and user-friendly IoT solutions for the future. The primary focus of this paper is on a custom IoT localization tag, for which we simulate solar cells – ranging from basic modeling to their integration into electrical circuits – and the power consumption of the tag's electronics platform. The analyzed sample platform is built on the nRF52833 microcontroller and the DW3110 ultra-wideband transceiver. We also applied our experimental framework principles to optimize power consumption and extend battery life. Reductions in photovoltaic panel area were achieved for both devices with a 5-year lifespan and fully autonomous tags, though with increased localization latency. Furthermore, this paper demonstrates how IoT devices, including their firmware, can be effectively modeled and simulated using publicly available tools.
11:25 CEST	MPP02.6	MULTI-PARTNER PROJECT: CONTRIBUTING TO TRUSTED CHIP DESIGN USING REVERSE ENGINEERING METHODS (RESEC) Speaker: Johanna Baehr, Fraunhofer AISEC, DE Authors: Bernhard Lippmann¹, Johanna Baehr², Horst Gieser³ and Alexander Hepp² ¹Infineon Technologies, DE; ²TU Munich, DE; ³Fraunhofer EMFT, DE Abstract Abstract—The RESEC (REconstruction of highly integrated SECurity devices) project addresses the growing concerns of malicious modification and IP piracy in globally distributed supply chains. The project's primary objective is to develop, verify, and optimize a complete reverse engineering process for integrated circuits manufactured in technology nodes of 40 nm and below. This paper highlights the significant contributions of RESEC in the areas of sample preparation, computer vision, and netlist analysis, thereby extending the state-of-the-art in reverse engineering. The project's outcomes are expected to have a profound impact on the development and physical verification of trusted chips, paving the way for future research
11:30 CEST	MPP02.7	MULTI-PARTNER PROJECT: SMART SENSOR ANALOG FRONT-ENDS POWERED BY EMERGING RECONFIGURABLE DEVICES (SENSOTERIC) Speaker: Jens Trommer, NaMLab gGmbH, DE Authors: Giulio Galderisi¹, Andreas Kramer², Andreas Fuchsberger³, Jose Maria Gonzalez-Medina⁴, Yuxuan He¹, Lee-Chi Hung⁴, Merrit Jen Hong Li⁵, Julian Kulenkampff², Maximilian Reuter², Lukas Wind³, Masiar Sistani³, Thomas Mikolajick¹, Bruno Neckel-Wesling⁶, Marina Deng⁶, Cristell Maneux⁶, Pieter Harpe⁵, Sonia Prado Lopez³, Oskar Baumgartner⁴, Chhandak Mukherjee⁶, Eugenio Cantatore⁵, Sandro Carrara⁷, Klaus Hofmann², Water Weber³ and Jens Trommer¹ ¹NaMLab gGmbH, DE; ²TU Darmstadt, DE; ³TU Vienna, AT; ⁴Global TCAD Solutions GmbH, AT; ⁵Eindhoven University of Technology, NL; ⁶University of Bordeaux, FR; ⁷EPFL, CH Abstract This work introduces SENSOTERIC, an HORIZON EU multi-partner project that aims at leveraging the properties of emerging Reconfigurable Field Effect Transistors (RFETs) to develop a sensor platform. RFETs will be used for a generic sensor interface and for a dedicated transducer element. In the first case, our goal is to develop an analog front-end interface that can be tuned at runtime to adapt to different environmental conditions and be used in a broad spectrum of applications. This feature shall be enabled by the polarity-control and negative differential resistance characteristics of the reconfigurable devices employed, that are co-integrable on industrial CMOS processes such as 22 nm FDSOI. In the second case, we want to exploit the intrinsic nature of these doping-free devices to yield better 1/f noise performances when compared to classic CMOS transducers. Moreover, the presence of un-gated areas on top of the channel of these devices, makes them the perfect candidates to be functionalized. In this early-stage overview of the project, we will introduce the key features and the vision that make SENSOTERIC a unique contribution towards smart sensing solutions in environmental monitoring and healthcare.
11:31 CEST	MPP02.8	MULTI-PARTNER PROJECT: SECURE HARDWARE ACCELERATED DATA ANALYTICS FOR 6G NETWORKS: THE PRIVATEER APPROACH Speaker: Ilias Papalamprou, National TU Athens, GR Authors: Ilias Papalamprou¹, Aimilios Leftheriotis², Apostolos Garos³, Georgios Gardikis⁴, Maria Christopoulou⁵, George Xilouris⁵, Lampros Argyriou⁶, Antonia Karamatskou⁷, Nikolaos Papadakis⁶, Emmanouil Kalotychos⁸, Nikolaos Chatzivasileiadis⁸, Dimosthenis Masouros¹ and Dimitrios Soudris¹ ¹National TU Athens, GR; ²University of Patras, GR; ³R&D Department, Space Hellas S.A., GR; ⁴R&D Department, Space Hellas S.A., ; ⁵Institute of Informatics and Telecommunications, NCSR "Demokritos", GR; ⁶Infili Technologies S.A., GR; ⁷Infili Tehcnologies S.A., GR; ⁸UBITECH Ltd., Digital Security & Trusted Computing Group, GR Abstract Next generation 6G networks are designed to meet the requirements of modern applications, including the need for higher bandwidth and ultra-low latency services. While these networks show significant potential to fulfill these evolving connectivity needs, they also bring new challenges, particularly in the area of security. Meanwhile, ensuring the privacy is paramount in 6G network development, demanding robust solutions following "privacy-by-design" principles. To address these challenges, PRIVATEER project strengthens existing security mechanisms, introducing privacy-centric enablers tailored for 6G networks. This work, evaluates key enablers within PRIVATEER, focusing on the development and acceleration of AI-driven anomaly detection models, as well as attestation mechanisms for both hardware accelerators and containerized applications

SD04 Special Day on Emerging Computing Paradigms

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Auditorium Pasteur

Session chair:
John Paul Strachan, Forschungszentrum Juelich GmbH, DE

Time	Label	Presentation Title Authors
11:00 CEST	SD04.1	HOW TO BUILD QUANTUM COMPUTERS AND HOW TO USE THEM Presenter: Tommaso Calarco, University of Cologne, DE Author: Tommaso Calarco, University of Cologne, DE Abstract .
11:30 CEST	SD04.2	CELLULAR AND DEVELOPMENTAL PATHWAYS TO MACHINE INTELLIGENCE Presenter: Sebastian Risi, IT Copenhagen, DK Author: Sebastian Risi, IT Copenhagen, DK Abstract .
12:00 CEST	SD04.3	TOWARDS SCALABLE PROBABILISTIC COMPUTERS FOR BINARY OPTIMIZATION AND BEYOND Presenter: Corentin Delacour, University of California, Santa Barbara, US Author: Corentin Delacour, University of California, Santa Barbara, US Abstract .

TS20 Physical analysis and design

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 2

Session chair:
Lukas Sekanina, Brno University, CZ

Session co-chair:
Vojech Mrazek, Brno University, CZ

Time	Label	Presentation Title Authors
11:00 CEST	TS20.1	MEGAROUTE: UNIVERSAL AUTOMATED LARGE-SCALE PCB ROUTING METHOD WITH ADAPTIVE STEP-SIZE SEARCH Speaker: Haiyun Li, Tsinghua University, CN Authors: Haiyun Li¹ and Jixin Zhang² ¹School of Computer Science, Hubei University of Technology, Wuhan, China; Shenzhen International Graduate School, Tsinghua University, Shenzhen, China, CN; ²Hubei University of Technology, CN Abstract The automation of very large-scale PCB routing has long been an unresolved problem within the industry due to the variant electronic components and complex design rules. Existing automated PCB routing methods are primarily designed for single component (e.g., BGA, BTB, etc.) or for simple and small-scale PCBs, and often fail to meet the industry requirements for large-scale PCBs. The biggest challenge is to ensure nearly 100% routability and DRC compliance while achieving high efficiency for large-scale PCBs with various components. To address this challenge, we propose MegaRoute, a precise, efficient, and universal PCB routing method that surpasses the routing routability and DRC compliance of existing methods, including commercial tools, for PCBs with thousands of nets. MegaRoute introduces an adaptive step-size search algorithm that adjusts exploration steps based on design rules and surrounding obstacles, improving both routability and efficiency. We incorporate shape-based obstacle detection for strict DRC compliance and use routing optimization techniques to enhance routability. We conduct extensive experiments on hundreds of real-world PCBs, including mainboard PCBs with thousands of nets. The results show that MegaRoute achieves over 98% routability across all PCBs with DRC-free results, significantly outperforming the state-of-the-art methods and mainstream commercial tools.
11:05 CEST	TS20.2	TIMING-DRIVEN GLOBAL PLACEMENT WITH HYBRID HEURISTICS AND NADAM-BASED NET WEIGHTING Speaker: Linhao Lu, Southwest University of Science and Technology, CN Authors: Linhao Lu, Wenxin Yu, Hongwei Tian, Chenjin Li, Xinmiao Li, Zhaoqi Fu and Zhengjie Zhao, Southwest University of Science and Technology, CN Abstract Timing optimization is critical to the entire design flow of the very-large-scale integrated (VLSI) circuit, and Global Placement is pivotal in achieving timing closure within the design flow of very-large-scale integration circuits. However, most global placement algorithms focus on optimizing wirelength rather than timing. Therefore, we propose a novel timing-driven global placement algorithm to address this gap. This paper proposes a timing-driven global placement algorithm utilizing a Nadam-based net-weighting strategy. Additionally, we employ a hybrid heuristic approach for adaptive dynamic adjustment of net weights. The experimental results on the ICCAD 2015 contest benchmarks show that compared to the RePlAce, our algorithm significantly improves WNS and TNS by 40.7% and 56.5%, respectively.
11:10 CEST	TS20.3	IR-FUSION: A FUSION FRAMEWORK FOR STATIC IR DROP ANALYSIS COMBINING NUMERICAL SOLUTION AND MACHINE LEARNING Speaker: Feng Guo, Beijing University of Posts and Telecommunications, CN Authors: Feng Guo¹, Jianwang Zhai¹, Jingyu Jia¹, Jiawei Liu¹, Kang Zhao¹, Bei Yu² and Chuan Shi¹ ¹Beijing University of Posts and Telecommunications, CN; ²The Chinese University of Hong Kong, HK Abstract IR Drop analysis for on-chip power grids (PGs) is vital but computationally challenging due to the rapid growth in the integrated circuit (IC) scale. Traditional numerical methods employed by current EDA software are accurate but extremely time-consuming. To achieve rapid analysis of IR drop, various machine learning (ML) methods have been introduced to address the inefficiency of numerical methods. However, the issue of interpretability or scalability has been limiting practical applications. In this work, we propose IR-Fusion, which aims to combine numerical methods with ML to achieve the trade-off and complementarity between accuracy and efficiency in static IR drop analysis. Specifically, the numerical method is used to obtain rough solutions and ML models are utilized to improve accuracy further. In our framework, an efficient numerical solver, AMG-PCG, is applied to get rough numerical solutions. Then, based on the numerical solution, the fusion of hierarchical numerical-structural information representing the multilayer structure of the PG is employed, and an Inception Attention U-Net model is designed to capture details and interaction of features at different scales. To cope with the limitations and diversity of PG designs, an augmented curriculum learning strategy is applied to the training phase. Evaluation of IR-Fusion shows that its accuracy is significantly better than previous ML-based methods while requiring considerably less iteration on solver to achieve the same accuracy compared with numerical methods.
11:15 CEST	TS20.4	TIMING-DRIVEN DETAILED PLACEMENT WITH UNSUPERVISED GRAPH LEARNING Speaker: Dhoui Lim, Ulsan National Institute of Science and Technology, KR Authors: DhouI Lim¹ and Heechun Park² ¹Kookmin University, School of Electrical Engineering, KR; ²Ulsan National Institute of Science and Technology (UNIST), KR Abstract Detailed placement is a crucial stage in VLSI design that starts from the global placement result to determine the final legal locations of each cell through fine-grained optimization. Traditional detailed placement methods focus on minimizing the half-perimeter wire length (HPWL) as in global placement. However, incorporating timing-driven placement becomes essential with the increasing complexity of VLSI designs and tighter performance constraints. In this paper, we propose a timing-driven detailed placement framework that leverages unsupervised graph learning techniques. Specifically, we integrate timing-related metrics into the objective function for detailed placement and formulate it into the loss function of a graph neural network (GNN) model. The loss function includes overlap, legality, and timing-related arc lengths, with appropriate weights using Bayesian optimization. Experimental results show that our framework achieves comparable or improved HPWL while significantly reducing total negative slack (TNS) by 5.5%, compared to existing methods.
11:20 CEST	TS20.5	EFFICIENT AND EFFECTIVE MACRO PLACEMENT FOR VERY LARGE SCALE DESIGNS USING RL AND MCTS INTEGRATION Speaker: Zong-Ze Lee, National Cheng Kung University, TW Authors: Jai-Ming Lin¹, Zong-Ze Lee¹ and Nan-Chu Lin² ¹Department of Electrical Engineering, National Cheng Kung University, TW; ²National Cheng Kung University, TW Abstract Macro placement plays a critical role in modern designs. With the rise of artificial intelligence, some researchers have turned to reinforcement learning (RL) techniques to handle this problem. However, these approaches usually require substantial computing resources and runtime for training, making them impractical for very large-scale integration (VLSI) designs. To address these challenges, this paper proposes an effective placer based on the Monte Carlo Tree Search (MCTS) algorithm, guided by a pre-trained RL agent. To reduce the complexities of RL and MCTS, we transform the macro placement problem into a macro group allocation problem. Additionally, we propose a new reward function to facilitate training convergence in RL. Moreover, to reduce runtime without affecting placement quality, we use the pre-training result to directly evaluate the placement quality in MCTS for non-terminal nodes, significantly reducing the number of placement runs required. Experiments show that our MCTS-based placer can achieve high-quality results even in the early stages of RL training. Moreover, our method outperforms state-of-the-art placers.
11:25 CEST	TS20.6	DAMIL-DCIM: A DIGITAL CIM LAYOUT SYNTHESIS FRAMEWORK WITH DATAFLOW-AWARE FLOORPLAN AND MILP-BASED DETAILED PLACEMENT Speaker: Chuyu Wang, Fudan University, CN Authors: Chuyu Wang, Ke Hu, Fan Yang, Keren Zhu and Xuan Zeng, Fudan University, CN Abstract Digital computing-in-memory (DCIM) systems integrate complex digital logic with parasitic-sensitive bitcell arrays. Conventional physical design strategies degrade DCIM performance due to a lack of dataflow regularity and excessive wirelength. As a result, current DCIM design often relies on manual layout, which is time-consuming and a bottleneck in the design cycle. Existing layout synthesis frameworks for DCIM often mimic the manual approach and employ a template-based method for DCIM placement. However, overly constrained templates lead to excessive core area, resulting in high costs in practice. In this work, we introduce DAMIL-DCIM, a novel placement framework that bridges template-based techniques with optimization-based placement methods. DAMIL-DCIM utilizes a global dataflow-aware floorplan inspired by template methods and further optimizes the layout using MILP(Mixed Integer Linear Programming)-based detailed placement. The combination of global floorplanning and placement optimization reduces total wire length while maintaining dataflow regularity, resulting in lower parasitic and enhanced performance. Experimental results show, on a practical 28nm DCIM circuit, our approach improves frequency by 25.2% and reduces power consumption by 19.6% compared to Cadence Innovus, while maintaining the same core area.
11:30 CEST	TS20.7	BI-LEVEL OPTIMIZATION ACCELERATED DRC-AWARE PHYSICAL DESIGN AUTOMATION FOR PHOTONIC DEVICES Speaker: Hao Chen, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Hao Chen¹, Yuzhe Ma¹ and Yeyu Tong² ¹The Hong Kong University of Science and Technology (Guangzhou), CN; ²The Hong Kong University of Science and Technology (Guangzhou)), CN Abstract Photonic integrated circuits (PICs) design has been challenged by the complex physics behind various integrated photonic devices. Inverse design offers an effective design automation solution for obtaining high-performance and compact photonic devices using computational algorithms and electromagnetic (EM) simulations. However, the challenge lies in transforming the fabrication-infeasible device geometries obtained from computational algorithms into reliable while optimal physical design. Incorporating fabrication constraints into the optimization iterations can extend running time and lead to performance compromise. In this work, we proposed a novel DRC-aware photonic inverse design framework, leveraging the bi-level optimization to enable end-to-end gradient-based device optimization. Our method can guarantee all intermediate devices on the optimization trajectory adhere to fabrication requirements and rules. The proposed workflow eliminates the need for a binarization process and fabrication constraint adaption, thus enabling a fast and efficient search for high-performance and reliable integrated photonic devices. Experimental results demonstrate the benefits of our proposed method, including improved device performance and reduced EM simulations and running time.
11:35 CEST	TS20.8	GTN-CELL: EFFICIENT STANDARD CELL CHARACTERIZATION USING GRAPH TRANSFORMER NETWORK Speaker: LIHAO LIU, State Key Lab of Integrated Chips and Systems, and School of Microelectronics, Fudan University, Shanghai, China., CN Authors: Lihao Liu, Beisi Lu, Yunhui Li, Li Shang and Fan Yang, Fudan University, CN Abstract Lookup table (LUT)-based libraries of standard cell characterization is crucial to accurate static timing analysis (STA). However, with the continuous scaling of technology nodes and the increasing complexity of circuit designs, the traditional non-linear delay model (NLDM) is progressively unable to meet the required accuracy for cell modeling. The current source model (CSM) offers a more precise characterization of cells at advanced nodes and is able to handle arbitrary electrical waveforms. However, the CSM is highly time-consuming because it requires extensive transistor-level simulations, posing severe challenges to efficient standard cell library design. This work presents GTN-Cell, an efficient graph transformer network (GTN)-based method for library-compatible LUT-based CSM waveform prediction of standard cell characterization. GTN-Cell represents the transistor-level structures of standard cells as graphs, learning the local structural information of each cell. By incorporating the transformer encoder into the model and embedding path-related positional encodings, GTN-Cell captures the global relationships between distant nodes within each cell. Compared with HSPICE, the GTN-Cell achieves an average error of 2.27% on predicted voltage waveforms among different standard cells and timing arcs while reducing the number of simulations by 70%.
11:40 CEST	TS20.9	WIRE-BONDING FINGER PLACEMENT FOR FBGA SUBSTRATE LAYOUT DESIGN WITH FINGER ORIENTATION CONSIDERATION Speaker: Yu-En Lin, National Taiwan University of Science and Technology, Department of Computer Science and Information Engineering, TW Authors: Yu-En Lin and Yi-Yu Liu, National Taiwan University of Science and Technology, TW Abstract Wire bonding is a mature packaging technique that enables chip pins to transmit signals to bonding fingers on the substrate through bonding wires. Such commodity technology is also essential in supporting the rapid development of the system in package and heterogeneous integration technologies. However, the automation tools are relatively deficient compared to other packaging techniques, resulting in tremendous manual design time and engineering effort due to numerous wire-bonding design constraints. This paper addresses the finger placement problem and serves as the first work considering the orientation constraint of fingers. The finger placement flow is divided into three stages. First, an integer linear programming (ILP) formulation is developed to allocate each net finger row. After that, we utilize mixed-integer quadratic programming (MIQP) to place the bonding fingers and consider the wire crossing constraint. Finally, the locations of the bonding finger are refined by considering both the bonding finger orientation angle and the finger spacing constraints. The final layouts generated by our integrated finger placement and substrate routing framework outperform manual designs in terms of the design time, the total wirelength, and the routing completion rate.
11:45 CEST	TS20.10	A PARALLEL FLOATING RANDOM WALK SOLVER FOR REPRODUCIBLE AND RELIABLE CAPACITANCE EXTRACTION Speaker: Jiechen Huang, Dept. Computer Science & Tech., Tsinghua University, CN Authors: Jiechen Huang¹, Shuailong Liu² and Wenjian Yu¹ ¹Tsinghua University, CN; ²Exceeda Inc., CN Abstract The floating random walk (FRW) method is a popular and promising tool for capacitance extraction, but its stochastic nature leads to critical limitations in reproducibility and physics-related reliability. In this work, we present FRW-RR, a parallel FRW solver with enhancements for Reproducible and Reliable capacitance extraction. First, we propose a novel parallel FRW scheme that ensures reproducible results, regardless of the degree of parallelism (DOP) or machine used. We further optimize its parallel efficiency and enhance the numerical stability. Then, to guarantee the physical properties of capacitances and reliability for downstream tasks, we propose a regularization technique based on constrained multi-parameter estimation to postprocess FRW's results. Experiments on actual IC structures demonstrate that, FRW-RR ensures DOP-independent reproducibility (with at least 12 decimal significant digits) and physics-related reliability with negligible overhead. It has remarkable advantages over existing FRW solvers, including the one in [1].
11:50 CEST	TS20.11	A COMPREHENSIVE INDUCTANCE-AWARE MODELING APPROACH TO POWER DISTRIBUTION NETWORK IN HETEROGENEOUS 3D INTEGRATED CIRCUITS Speaker: Yuanqing Cheng, Beihang University, CN Authors: Quansen Wang¹, Vasilis Pavlidis² and Yuanqing Cheng¹ ¹Beihang University, CN; ²Aristotle University of Thessaloniki, GR Abstract Heterogeneous 3D integration technology is a costeffective and high-performance alternative to planar integrated circuits (ICs). In this paper, we propose an on-chip power distribution network (PDN) modeling technique for heterogeneous 3D-ICs (H3D-ICs), which explicitly takes the effects of on-chip inductance into account. The proposed model facilitates efficient transient and AC simulations with integrated inductive effects, enabling accurate noise characterization at high frequencies and facilitating the exploration of early-stage PDN design. The model is validated via HSPICE simulations, demonstrating a maximum error below 1% and achieving average speedups of 1.5x in transient and 8.5x in AC simulations.

TS21 Design methodologies for machine learning architectures

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Salon Pasteur

Session chair:
Alessio Burrello, Politecnico di Torino, IT

Session co-chair:
Victor Jung, ETH Zurich, CH

Time	Label	Presentation Title Authors
11:00 CEST	TS21.1	SPNERF: MEMORY EFFICIENT SPARSE VOLUMETRIC NEURAL RENDERING ACCELERATOR FOR EDGE DEVICES Speaker: Yipu Zhang, The Hong Kong University of Science and Technology, HK Authors: Yipu Zhang¹, Jiawei Liang¹, Jian Peng¹, Jiang Xu² and Wei Zhang¹ ¹The Hong Kong University of Science and Technology, HK; ²The Hong Kong University of Science and Technology (Guangzhou), CN Abstract Neural rendering has gained prominence for its high-quality output, which is crucial for AR/VR applications. However, its large voxel grid data size and irregular access patterns challenge real-time processing on edge devices. While previous works have focused on improving data locality, they have not adequately addressed the issue of large voxel grid sizes, which necessitate frequent off-chip memory access and substantial on-chip memory. This paper introduces SpNeRF, a software-hardware co-design solution tailored for sparse volumetric neural rendering. We first identify memory-bound rendering inefficiencies and analyze the inherent sparsity in the voxel grid data of neural rendering. To enhance efficiency, we propose novel preprocessing and online decoding steps, reducing memory size for the voxel grid. The preprocessing step employs hash mapping to support irregular data access while maintaining a minimal memory size. The online decoding step enables efficient on-chip sparse voxel grid processing, incorporating bitmap masking to mitigate PSNR loss caused by hash collisions. To further optimize performance, we design a dedicated hardware architecture supporting our sparse voxel grid processing technique. Experimental results demonstrate that SpNeRF achieves an average 21.07× reduction in memory size while maintaining comparable PSNR levels. When benchmarked against Jetson XNX, Jetson ONX, RT-NeRF.Edge and NeuRex.Edge, our design achieves speedups of 95.1×, 63.5×, 1.5× and 10.3×, and improves energy efficiency by 625.6×, 529.1×, 4×, and 4.4×, respectively.
11:05 CEST	TS21.2	SBQ: EXPLOITING SIGNIFICANT BITS FOR EFFICIENT AND ACCURATE POST-TRAINING DNN QUANTIZATION Speaker: Jiayao Ling, Shanghai Jiao Tong University, CN Authors: Jiayao Ling¹, Gang Li², Qinghao Hu², Xiaolong Lin¹, Cheng Gu¹, Jian Cheng³ and Xiaoyao Liang¹ ¹Shanghai Jiao Tong University, CN; ²Institute of Computing Technology, Chinese Academy of Sciences, CN; ³Institute of Automation, CN Abstract Post-Training Quantization is an effective technique for deep neural network acceleration. However, as the bit-width decreases to 4 bits and below, PTQ faces significant challenges in preserving accuracy, especially for attention-based models like LLMs. The main issue lies in considerable clipping and rounding errors induced by the limited number of quantization levels and narrow data range in conventional low-precision quantization. In this paper, we present an efficient and accurate PTQ method that targets 4 bits and below through algorithm and architecture co-design. Our key idea is to dynamically extract a small portion of significant bit terms from high-precision operands to perform low-precision multiplications under the given computational budget. Specifically, we propose Significant-Bit Quantization (SBQ). It exploits a product-aware method to dynamically identify significant terms and an error-compensated computation scheme to minimize compute errors. We present a dedicated inference engine to unleash the power of SBQ. Experiments on CNNs, ViTs, and LLMs reveal that SBQ consistently outperforms prior PTQ methods under 2~4-bit quantization. We also compare the proposed inference engine with state-of-the-art bit-operation-based quantization architectures TQ and Sibia. Results show that SBQ can achieve the highest area and energy efficiency.
11:10 CEST	TS21.3	AIRCHITECT V2: LEARNING THE HARDWARE ACCELERATOR DESIGN SPACE THROUGH UNIFIED REPRESENTATIONS Speaker: Akshat Ramachandran, Georgia Tech, US Authors: Akshat Ramachandran¹, Jamin Seo¹, Yu-Chuan Chuang², Anirudh Itagi¹ and Tushar Krishna¹ ¹Georgia Tech, US; ²National Taiwan University, TW Abstract Design space exploration (DSE) plays a crucial role in enabling custom hardware architectures, particularly for emerging applications like AI, where optimized and specialized designs are essential. With the growing complexity of deep neural networks (DNNs) and the introduction of advanced foundational models (FMs), the design space for DNN accelerators is expanding at an exponential rate. Additionally, this space is highly non-uniform and non-convex, making it increasingly difficult to navigate and optimize. Traditional DSE techniques rely on search-based methods, which involve iterative sampling of the design space to find the optimal solution. However, this process is both time-consuming and often fails to converge to the global optima for such design spaces. Recently, AIrchitect v1, the first attempt to address the limitations of search-based techniques, transformed DSE into a constant-time classification problem using recommendation networks. In this work, we propose AIrchitect v2, a more accurate and generalizable learning-based DSE technique applicable to large-scale design spaces that overcomes the shortcomings of earlier approaches. Specifically, we devise an encoder-decoder transformer model that (a) encodes the complex design space into a uniform intermediate representation using contrastive learning and (b) leverages a novel unified representation blending the advantages of classification and regression to effectively explore the large DSE space without sacrificing accuracy. Experimental results evaluated on 10^5 real DNN workloads demonstrate that, on average, AIrchitect v2 outperforms existing techniques by 15% in identifying optimal design points. Furthermore, to demonstrate the generalizability of our method, we evaluate performance on unseen model workloads (LLMs) and attain a 1.7x improvement in inference latency on the identified hardware architecture. Code and dataset are available at: https://github.com/maestro-project/AIrchitect-v2.
11:15 CEST	TS21.4	ZEBRA: LEVERAGING DIAGONAL ATTENTION PATTERN FOR VISION TRANSFORMER ACCELERATOR Speaker: Sukhyun Han, Sungkyunkwan University, KR Authors: Sukhyun Han, Seongwook Kim, Gwangeun Byeon, Jihun Yoon and Seokin Hong, Sungkyunkwan University, KR Abstract Vision Transformers (ViTs) have achieved remarkable performance in computer vision, but their computational complexity and challenges in optimizing memory bandwidth limit hardware acceleration. A major bottleneck lies in the self-attention mechanism, which leads to excessive data movement and unnecessary computations despite high input sparsity and low computational demands. To address this challenge, existing transformer accelerators have leveraged sparsity in attention maps. However, their performance gains are limited due to low hardware utilization caused by the irregular distribution of non-zero values in the sparse attention maps. Self-attention often exhibits strong diagonal patterns in the attention map, as the diagonal elements tend to have higher values than others. To exploit this, we introduce Zebra, a hardware accelerator framework optimized for diagonal attention patterns. A core component of Zebra is the Striped Diagonal (SD) pruning technique, which prunes the attention map by preserving only the diagonal elements at runtime. This reduces computational load without requiring offline pre-computation or causing significant accuracy loss. Zebra features a reconfigurable accelerator architecture that supports optimized matrix multiplication method, called Striped Diagonal Matrix Multiplication (SDMM), which computes only the diagonal elements of matrices. With this novel method, Zebra addresses low hardware utilization, a key barrier to leveraging the diagonal patterns. Experimental results demonstrate that Zebra achieves a 57x speedup over a CPU and 1.7x over the state-of-the-art ViT accelerator with similar inference accuracy.
11:20 CEST	TS21.5	PUSHING UP TO THE LIMIT OF MEMORY BANDWIDTH AND CAPACITY UTILIZATION FOR EFFICIENT LLM DECODING ON EMBEDDED FPGA Speaker: Jindong Li, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang and Yi Zeng, Institute of Computing Technology, Chinese Academy of Sciences, CN Abstract The extremely high computational and storage demands of large language models have excluded most edge devices, which were widely used for efficient machine learning, from being viable options. A typical edge device usually only has 4GB of memory capacity and a bandwidth of less than 20GB/s, while a large language model quantized to 4-bit precision with 7B parameters already requires 3.5GB of capacity, and its decoding process is purely bandwidth-bound. In this paper, we aim to explore these limits by proposing a hardware accelerator for large language model (LLM) inference on the Zynq-based KV260 platform, equipped with 4GB of 64-bit 2400Mbps DDR4 memory. We successfully deploy a LLaMA2-7B model, achieving a decoding speed of around 5 token/s, utilizing 93.3% of the memory capacity and reaching 85% decoding speed of the theoretical memory bandwidth limit. To fully reserve the memory capacity for model weights and key-value cache, we develop the system in a bare-metal environment without an operating system. To fully reserve the bandwidth for model weight transfers, we implement a customized dataflow with an operator fusion pipeline and propose a data arrangement format that can maximize the data transaction efficiency. This research marks the first attempt to deploy a 7B level LLM on a standalone embedded field programmable gate array (FPGA) device. It provides key insights into efficient LLM inference on embedded FPGA devices and provides guidelines for future architecture design.
11:25 CEST	TS21.6	LEVERAGING COMPUTE-IN-MEMORY FOR EFFICIENT GENERATIVE MODEL INFERENCE IN TPUS Speaker: Zhantong Zhu, Peking University, CN Authors: Zhantong Zhu, Hongou Li, Wenjie Ren, Meng Wu, Le Ye, Ru Huang and Tianyu Jia, Peking University, CN Abstract With the rapid advent of generative models, efficiently deploying these models on specialized hardware has become critical. Tensor Processing Units (TPUs) are designed to accelerate AI workloads, but their high power consumption necessitates innovations for improving efficiency. Compute-in-memory (CIM) has emerged as a promising paradigm with superior area and energy efficiency. In this work, we present a TPU architecture that integrates digital CIM to replace conventional digital systolic arrays in matrix multiply units (MXUs). We first establish a CIM-based TPU architecture model and simulator to evaluate the benefits of CIM for diverse generative model inference. Building upon the observed design insights, we further explore various CIM-based TPU architectural design choices. Up to 44.2% and 33.8% performance improvement for large language model and diffusion transformer inference, and 27.3x reduction in MXU energy consumption can be achieved with different design choices, compared to the baseline TPUv4i architecture.
11:30 CEST	TS21.7	SPARSEINFER: TRAINING-FREE PREDICTION OF ACTIVATION SPARSITY FOR FAST LLM INFERENCE Speaker: Jiho Shin, University of Seoul, KR Authors: Jiho Shin¹, Hoeseok Yang² and Youngmin Yi³ ¹University of Seoul, KR; ²Santa Clara University, US; ³Sogang University, KR Abstract Leveraging sparsity is crucial for optimizing large language model (LLM) inference; however, modern LLMs employing SiLU as their activation function exhibit minimal activation sparsity. Recent research has proposed replacing SiLU with ReLU to induce significant activation sparsity and showed no downstream task accuracy degradation through fine-tuning. However, taking full advantage of it required training a predictor to estimate this sparsity. In this paper, we introduce SparseInfer, a simple, light-weight, and training-free predictor for activation sparsity of ReLU-fied LLMs, in which activation sparsity is predicted by comparing only the sign bits of inputs and weights. To compensate for possible prediction inaccuracy, an adaptive tuning of the predictor's conservativeness is enabled, which can also serve as a control knob for optimizing LLM inference. The proposed method achieves approximately 21% faster inference speed over the state-of-the-art, with negligible accuracy loss of within 1%p.
11:35 CEST	TS21.8	LOW-RANK COMPRESSION FOR IMC ARRAYS Speaker: Kang Eun Jeon, Sungkyunkwan University, KR Authors: Kang Eun Jeon, Johnny Rhe and Jong Hwan Ko, Sungkyunkwan University, KR Abstract In this study, we address the challenge of low-rank model compression in the context of in-memory computing (IMC) architectures.Traditional pruning approaches, while effective in model size reduction, necessitate additional peripheral circuitry to manage complex dataflows and mitigate dislocation issues, leading to increased area and energy overheads, especially when model sparsity does not meet a specific threshold. To circumvent these drawbacks, we propose leveraging low-rank compression techniques, which, unlike pruning, streamline the dataflow and seamlessly integrate with IMC architectures. However, low-rank compression presents its own set of challenges, notably suboptimal IMC array utilization and compromised accuracy compared to traditional pruning methods. To address these issues, we introduce a novel approach employing shift and duplicate kernel (SDK) mapping technique, which exploits idle IMC columns for parallel processing, and group low-rank convolution, which mitigates the information imbalance in the decomposed matrices. Our experimental results, using ResNet-20 and Wide ResNet16-4 networks on CIFAR-10 and CIFAR-100 datasets, demonstrate that our proposed method not only matches the performance of existing pruning techniques on ResNet-20 but also achieves up to 2.5x speedup and +20.9% accuracy boost on Wide ResNet16-4.
11:40 CEST	TS21.9	INTEGER UNIT-BASED OUTLIER-AWARE LLM ACCELERATOR PRESERVING NUMERICAL ACCURACY OF FP-FP GEMM Speaker: Jehun Lee, Seoul National University, KR Authors: Jehun Lee and Jae-Joon Kim, Seoul National University, KR Abstract The proliferation of large language models (LLMs) has significantly heightened the importance of quantization to alleviate the computational burden given the surge in the number of parameters. However, quantization often targets a subset of a LLM and relies on the floating-point (FP) arithmetic for matrix multiplication of specific subsets, leading to performance and energy overhead. Additionally, to compensate for the quality degradation incurred by quantization, retraining methods are frequently employed, demanding significant efforts and resources. This paper proposes OwL-P, an outlier-aware LLM inference accelerator which preserves the numerical accuracy of FP arithmetic while enhancing hardware efficiency with an integer (INT)-based arithmetic unit for general matrix multiplication (GEMM), through the use of a shared exponent and efficient management of outlier data. It also mitigates off-chip bandwidth requirements by employing a compressed number format. The proposed number format leverages outliers and shared exponents to facilitate the compression of both model weights and activations. We evaluate this work across 10 different transformer-based benchmarks, and the results demonstrate that the proposed integer-based LLM accelerator achieves an average 2.70× performance gain and 3.57× energy savings while maintaining the numerical accuracy of the FP arithmetic.
11:45 CEST	TS21.10	LEVERAGING HOT DATA IN A MULTI-TENANT ACCELERATOR FOR EFFECTIVE SHARED MEMORY MANAGEMENT Speaker: Chunmyung Park, Seoul National University, KR Authors: Chunmyung Park, Jicheon Kim, Eunjae Hyun, Xuan Truong Nguyen and Hyuk-Jae Lee, Seoul National University, KR Abstract Multi-tenant neural networks (MTNN) have been emerging in various domains. To effectively handle multi-tenant workloads, modern hardware systems typically incorporate multiple compute cores with shared memory systems. While prior works have intensively studied compute- and bandwidth-aware allocation, on-chip memory allocation for MTNN accelerators has not been well studied. This work identifies two key challenges of on-chip memory allocation in MTNN accelerators: on-chip memory shortages, which force data eviction to off-chip memory, and on-chip memory underutilization, where memory remains idle due to coarse-grained allocation. Both issues lead to increased external memory accesses (EMAs), significantly degrading system performance. To address these challenges, we propose HotPot, a novel multi-tenant accelerator with a runtime temperature-aware memory allocator. HotPot prioritizes hot data for global on-chip memory allocation, reducing unnecessary EMAs and optimizing memory utilization. Specifically, HotPot introduces a temperature score that quantifies reuse potential and guides runtime memory allocation decisions. Experimental results demonstrate that HotPot improves system throughput (STP) by up to 1.88× and average normalized turnaround time (ANTT) by 1.52× compared to baseline methods.
11:50 CEST	TS21.11	DOTS: DRAM-PIM OPTIMIZATION FOR TALL AND SKINNY GEMM OPERATIONS IN LLM INFERENCE Speaker: Gyeonghwan Park, Seoul National University, KR Authors: Gyeonghwan Park, Sanghyeok Han, Byungkuk Yoon and Jae-Joon Kim, Seoul National University, KR Abstract For large language models (LLMs), increasing token lengths require smaller batch sizes due to increase in memory requirement for KV caching, leading to under-utilization of processing units and memory bandwidth bottleneck in NPUs. To address the challenge, we propose DOTS, a new DRAM-PIM architecture that can handle both GEMV and GEMM efficiently, even outperforming NPUs in GEMM operations when batch sizes are small. The proposed DRAM-PIM reduces power consumption and latency caused by frequent DRAM row activation switching in conventional DRAM PIMs with negligible hardware overhead. Simulation results show that our proposed design achieves throughput improvements of 1.83x, 1.92x, and 1.7x over GPU, NPU, and heterogeneous NPU/PIM systems, respectively, for models as large as or larger than OPT-175B.
11:51 CEST	TS21.12	LLM4GV: AN LLM-BASED FLEXIBLE PERFORMANCE-AWARE FRAMEWORK FOR GEMM VERILOG GENERATION Speaker: Meiqi Wang, Sun Yat-sen University, CN Authors: Dingyang Zou¹, Gaoche Zhang¹, kairui sun², wen zhe³, Meiqi Wang² and Zhongfeng Wang¹ ¹Nanjing University, CN; ²Sun Yat-sen University, CN; ³sysu, CN Abstract Advancements in AI have increased the demand for specialized AI accelerators, with design for general matrix multiplication (GEMM) module being crucial but time-consuming. While large language models (LLMs) show promise for automating GEMM design, challenges arise from GEMM's vast design space and performance requirements. Existing LLM-based frameworks for RTL code generation often lack flexibility and performance awareness. To overcome the challenges, we propose LLM4GV, a multi-agent LLM-based framework that integrates hardware optimization techniques (HOTs) and performance modeling, improving correctness and performance of the generated code over prior works.

TS22 Design and test of hardware security primitives

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: Rhône 3AB

Session chair:
Mottaqiallah Taouil, TU Delft, NL

Session co-chair:
Michael Hutter, PQShield, AT

Time	Label	Presentation Title Authors
11:00 CEST	TS22.1	USING OFF-SET ONLY FOR CORRUPTING CIRCUIT TO RESIST STRUCTURAL ATTACK IN CAC LOCKING Speaker: Hsaing-Chun Cheng, National Tsing Hua University, TW Authors: Hsaing-Chun Cheng, RuiJie Wang and TingTing Hwang, National Tsing Hua University, TW Abstract Corrupt-and-Correct (CAC) Logic Lockings [1]–[4] are state-of-the-art hardware security techniques designed to protect IC/IP designs from IP piracy, reverse engineering, overproduction, and unauthorized use. Although these techniques are resilient to SAT-based attacks, they remain vulnerable to structural attacks, which exploit structural traces left by the synthesis tool to recover the original form. In this paper, we will propose a novel method that uses only the OFF-set to corrupt the circuit. This approach helps the added circuitry better merge with the original circuit, thereby thwarting structural attacks while maintaining resilience to SAT-based attacks. Additionally, we demonstrate that our proposed method can incur less area overhead compared to previous locking methods in HIID [5]. Compared to SFLL-rem [4], our method can achieve comparable area overhead while effectively resisting structural attacks, including Valkyrie [6] and SPI attacks [7].
11:05 CEST	TS22.2	RUNTIME SECURITY ANALYSIS OF MONOLITHIC 3D EMBEDDED DRAM WITH OXIDE-CHANNEL TRANSISTOR Speaker: Eduardo Ortega, Arizona State University, US Authors: Eduardo Ortega¹, Jungyoun Kwak², Shimeng Yu² and Krishnendu Chakrabarty¹ ¹Arizona State University, US; ²Georgia Tech, US Abstract We present the first security and disturbance study of monolithic 3D (M3D) embedded DRAM (eDRAM) with 2T gain cell using oxide-channel transistors. We explore the Rowhammer/Rowpress vulnerabilities on amorphous indium tungsten oxide (IWO) transistors for eDRAM with standalone 2D integration and memory-on-memory M3D integration. In addition, We examine M3D-specific electrical disturbances from memory-on-logic M3D integration. We evaluate IWO eDRAM's susceptibility to these vulnerabilities/disturbances and discuss the potential impact on M3D integration. We examine physical design and architecture strategies for M3D integration of IWO eDRAM. We provide systematic recommendations to inform security strategies for M3D integration and security of IWO eDRAM. Our results show that limiting the minimum vertical interlayer distance to 300 nm reduces vertical disturbances in memory-on-memory M3D integration. In addition, for memory-on-logic M3D integration, we observed that IWO eDRAM's read bitline is sensitive to crosstalk from high-speed switching logic circuits. In conjunction, we show that IWO eDRAM standalone 2D integration is 30X more resilient to Rowhammer than current state-of-the-art memory because the IWO transistor's ON/OFF current ratio is roughly three orders of magnitude greater than standard memory access transistors.
11:10 CEST	TS22.3	EXPLORING LARGE INTEGER MULTIPLICATION FOR CRYPTOGRAPHY TARGETING IN-MEMORY COMPUTING Speaker: Florian Krieger, TU Graz, AT Authors: Florian Krieger, Florian Hirner and Sujoy Sinha Roy, TU Graz, AT Abstract Emerging cryptographic systems such as Fully Homomorphic Encryption (FHE) and Zero-Knowledge Proofs (ZKP) are computation- and data-intensive. FHE and ZKP implementations in software and hardware largely rely on the von Neumann architecture, where a significant amount of energy is lost on data movements. A promising computing paradigm is computing in memory (CIM) which enables computations to occur directly within memory thereby reducing data movements and energy consumption. However, efficiently performing large integer multiplications – critical in FHE and ZKP – is an open question, as existing CIM methods are limited to small operand sizes. In this work, we address this question by exploring advanced algorithmic approaches for large integer multiplication, identifying the Karatsuba algorithm as the most effective for CIM applications. Thereafter, we design the first Karatsuba multiplier for resistive CIM crossbars. Our multiplier uses a three-stage pipeline to enhance throughput and, additionally, balances memory endurance with efficient array sizes. Compared to existing CIM multiplication methods, when scaled up to the bit widths required in ZKP and FHE, our design achieves up to 916x in throughput and 281x in area-time product improvements.
11:15 CEST	TS22.4	A LOW-COMPLEXITY TRUE RANDOM NUMBER GENERATION SCHEME USING 3D-NAND FLASH MEMORY Speaker: RUIBIN ZHOU, Sun Yat-sen University, CN Authors: Ruibin Zhou¹, Jian Huang¹, Xianping Liu², Yuhan Wang¹, Xinrui Zhang¹, Yungen Peng¹ and Zhiyi Yu¹ ¹Sun Yat-Sen University, CN; ²1.Sun Yat-Sen University 2.Peng Cheng Laboratory, CN Abstract Unpredictable true random numbers are essential in cryptographic applications and secure communications. However, implementing True Random Number Generators (TRNGs) typically requires specialized hardware devices. In this paper, we propose a low-complexity true random number extraction scheme that can be implemented in endpoint systems containing 3D-NAND flash memory chips, addressing the need for random numbers without requiring additional complex hardware. We successfully utilized the randomness of the rapid charging and discharging of shallow charge traps in 3D-NAND memory as an entropy source. The proposed approach only requires conventional user-mode erase, program, and read operations, without any special timing control. We successfully extracted random bitstream using this scheme without a post-debiasing process. We evaluated the randomness of the generated bitstream using the NIST SP 800-22 statistical test suite, and it passed all 15 tests.
11:20 CEST	TS22.5	A SYNTHESIZABLE THYRISTOR-LIKE LEAKAGE-BASED TRUE RANDOM NUMBER GENERATOR Speaker: Seohyun Kim, Ajou University, KR Authors: Seo Hyun Kim, Jang Hyun Kim and Jongmin Lee, Ajou University, KR Abstract As the demand for random data in cryptographic systems continues to rise, the importance of True Random Number Generators (TRNGs) becomes increasingly crucial for securing cryptographic applications. However, designing a TRNG that is reliable, secure, and cost-effective presents a significant challenge in hardware security. In this paper, we propose a synthesizable TRNG design based on a thyristor-like leakage-based (TL) structure, optimized for secure applications with small area and cost-efficiency. Our design has been validated using a 65-nm CMOS process, achieving a throughput of 0.397-Mbps within a compact area of 14.4-μm2, offering considerable cost savings while maintaining high randomness and area-throughput trade-off of 27.57 Gbps/mm2. Moreover, this TRNG can be synthesized as a standard cell through a semi-custom design flow, significantly reducing design costs and enabling design automation, which streamlines the process and reduces the time and effort required compared to traditional full-custom TRNGs. Additionally, as it is library characterized, the number of TL TRNG cells can be freely adjusted to meet specific application requirements, offering flexibility in both performance and scalability. To assess its randomness, the NIST statistical test suite was applied, and the proposed TL TRNG successfully passed all applicable tests, demonstrating its randomness.
11:25 CEST	TS22.6	GRAFTED TREES BEAR BETTER FRUIT: AN IMPROVED MULTIPLE-VALUED PLAINTEXT-CHECKING SIDE-CHANNEL ATTACK AGAINST KYBER Speaker: Jinnuo Li, School of Computer Science, China University of Geosciences, Wuhan, China, CN Authors: Jinnuo Li¹, Chi Cheng¹, Muyan Shen², Peng Chen¹, Qian Guo³, Dongsheng Liu⁴, Liji Wu⁵ and Jian Weng⁶ ¹China University of Geosciences, Wuhan, CN; ²School of Cryptology, University of Chinese Academy of Sciences, Beijing, China, CN; ³Lund University, Lund, Sweden, SE; ⁴School of Integrated Circuits, Huazhong University of Science and Technology, CN; ⁵(School of Integrated Circuits, Tsinghua University, Beijing, China, CN; ⁶College of Cyber Security, Jinan University, Guangzhou, China, CN Abstract As a prominent category of side-channel attacks (SCAs), plaintext-checking (PC) oracle-based SCAs offer the advantages of generality and operational simplicity on a targeted device. At TCHES 2023, Rajendran et al. and Tanaka et al. independently proposed the multiple-valued (MV) PC oracle, significantly reducing the required number of queries (a.k.a., traces) in the PC oracle. However, in practice, when dealing with environmental noise or inaccuracies in the waveform classifier, they still rely on majority voting or the other technique that usually results in three times the number of queries compared to the ideal case. In this paper, we propose an improved method to further reduce the number of queries of the MV-PC oracle, particularly in scenarios where the oracle is imperfect. Compared to the state-of-the-art at TCHES 2023, our proposed method reduces the number of queries for a full key recovery by more than 42.5%. The method involves three rounds. Our key observation is that coefficients recovered in the first round can be regarded as prior information to significantly aid in retrieving coefficients in the second round. This improvement is achieved through a newly designed grafted tree. Notably, the proposed method is generic and can be applied to both the NIST key encapsulation mechanism (KEM) standard Kyber and other significant candidates, such as Saber and Frodo. We have conducted extensive software simulations against Kyber-512, Kyber-768, Kyber-1024, FireSaber, and Frodo-1344 to validate the efficiency of the proposed method. An electromagnetic attack conducted on real-world implementations, using an STM32F407G board equipped with an ARM Cortex-M4 microcontroller and Kyber implementation from the public library pqm4, aligns well with our simulations.
11:30 CEST	TS22.7	CAS-PUF: CURRENT-MODE ARRAY-TYPE STRONG PUF FOR SECURE COMPUTING IN AREA CONSTRAINED SOCS Speaker: Dimosthenis Georgoulas, University of Ioannina, GR Authors: Dimosthenis Georgoulas, Yiorgos Tsiatouhas and Vasileios Tenentes, University of Ioannina, GR Abstract Secure computing necessitates the integration in Systems-on-Chips (SoCs) of strong Physical Unclonable Functions (PUFs) that can generate a vast amount of Challenge Response Pairs (CRPs) for cryptographic keys generation, identification and authentication. However, the excessive area cost of strong PUF designs imposes integration difficulties to SoCs of area constrained applications, such as the IoT and mobile computing. In this paper, we present a novel strong PUF design, with silicon area requirements significantly lower than those of previous strong PUFs. The proposed Current-mode Array-type Strong PUF (CAS-PUF) is based on a current source topology of only six minimum size transistors, which is tolerant to power supply variation for enhanced reliability. Compared to previous strong PUFs, the CAS-PUF achieves the same number of CRPs with 20% to 72% less area size; while for the same area size, it provides 19 to 53 orders of magnitude higher number of CRPs. Furthermore, extensive Monte Carlo simulations on CAS-PUF show a reliability of 96.45% under ±10% power supply fluctuation; and 97.69% under temperature variation (0°C to 80°C), with an average uniqueness and uniformity of 50.01% and 49.54%, respectively. Therefore, the CAS-PUF can be used as a hardware root of trust mechanism to secure computing in area constrained SoCs.
11:35 CEST	TS22.8	FLASH: AN EFFICIENT HARDWARE ACCELERATOR LEVERAGING APPROXIMATE AND SPARSE FFT FOR HOMOMORPHIC ENCRYPTION Speaker: Tengyu Zhang, Peking University, CN Authors: Tengyu Zhang¹, Yufei Xue², LING LIANG¹, Zhen Gu³, Yuan Wang¹, Runsheng Wang¹, Ru Huang¹ and Meng Li¹ ¹Peking University, CN; ²Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, HK; ³Alibaba Group, CN Abstract Private convolutional neural network (CNN) inference based on hybrid homomorphic encryption (HE) and two-party computation (2PC) emerges as a promising technique for sensitive user data protection. However, homomorphic convolutions (HConvs) suffer from high computation costs due to the extensive number theoretic transforms (NTTs). While customized accelerators have been proposed, they usually overlook the intrinsic error resilience and native sparsity of DNNs and hybrid HE/2PC protocols. In this paper, we propose FLASH, leveraging these key characteristics for highly efficient HConv. Specifically, we observe the private DNN inference is robust to computation errors and propose approximate fast Fourier transforms (FFTs) to replace NTTs and avoid the expensive modular reduction operations. We also design a flexible sparse FFT dataflow leveraging the high sparsity of weight plaintexts. With extensive experiments, we demonstrate FLASH improves the power efficiency by 90.7x for weight transforms and by 9.7x for all transforms in HConvs compared to existing works. As for the HConvs in ResNet-18 and ResNet-50, FLASH achieves about 87.3% energy consumption reduction.
11:40 CEST	TS22.9	HFL: HARDWARE FUZZING LOOP WITH REINFORCEMENT LEARNING Speaker: Lichao Wu, TU Darmstadt, DE Authors: Lichao Wu, Mohamadreza Rostami, Huimin Li and Ahmad-Reza Sadeghi, TU Darmstadt, DE Abstract As hardware systems grow increasingly complex, ensuring their security becomes more critical. This complexity often introduces difficult and costly vulnerabilities to address after fabrication. Traditional verification methods, such as formal and dynamic approaches, encounter limitations in scalability and efficiency when applied to complex hardware designs. While hardware fuzzing presents a promising solution for efficient and effective vulnerability detection, current methods face several challenges, including coverage saturation, long simulation times, and limited vulnerability detection capabilities. This paper introduces Hardware Fuzzing Loop (HFL), a novel fuzzing framework designed to address these limitations. We demonstrate that Long Short-Term Memory (LSTM), a machine learning model commonly used in natural language processing, can effectively capture the semantics of test cases and accurately predict hardware coverage. Building on this insight, we leverage reinforcement learning to optimize the test generation strategy dynamically within a hardware fuzzing loop. Our approach utilizes a multi-head LSTM to generate sophisticated RISC-V assembly instruction sequences, along with an LSTM-based predictor that evaluates the quality of these instructions. By dynamically interacting with the hardware, HFL efficiently explores complex instruction sequences with minimal fuzzing iterations, allowing it to uncover hard-to-detect vulnerabilities. We evaluated HFL on three RISC-V cores, and the results show that it achieves higher coverage using fewer than 1\% of the test cases required by leading hardware fuzzers, effectively mitigating the issue of coverage saturation. Furthermore, HFL identified all known vulnerabilities in the tested systems and discovered four previously unknown high-severity issues, demonstrating its significant potential in improving hardware security assessments.
11:45 CEST	TS22.10	REAP-NVM: RESILIENT ENDURANCE-AWARE NVM-BASED PUF AGAINST LEARNING-BASED ATTACKS Speaker: Hassan Nassar, Karlsruhe Institute of Technology, DE Authors: Hassan Nassar¹, Ming-Liang Wei², Chia-Lin Yang², Joerg Henkel¹ and Kuan-Hsun Chen³ ¹Karlsruhe Institute of Technology, DE; ²National Taiwan University, TW; ³University of Twente, NL Abstract NVM-based PUFs offer secure authentication and cryptographic applications by exploiting NVMs' MLC to generate diverse, ML-attack-resistant responses.nYet, frequent writes degrade these PUFs, lowering reliability and lifespan. This paper presents a model to assess endurance effects on NVM PUFs, guiding the creation of more robust PUFs. Our novel NVM PUF design enhances endurance by evenly distributing writes, thus mitigating cell stress, achieving a 62x improvement over current solutions while preserving security against learning-based attacks.
11:46 CEST	TS22.11	ACCELERATING OBLIVIOUS TRANSFER WITH A PIPELINED ARCHITECTURE Speaker: Xiaolin Li, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Xiaolin Li¹, wei yan¹, yong zhang², hongwei liu¹, qinfen hao¹, yong liu² and ninghui sun¹ ¹Institute of Computing Technology, Chinese Academy of Sciences, CN; ²Zhongguancun Laboratory, CN Abstract With the rapid development of machine learning and big data technologies, ensuring user privacy has become a pressing challenge. Secure multi-party computation offers a solution to this challenge by enabling privacy-preserving computations, but it also incurs significant performance overhead, thus limiting its further application. Our analysis reveals that the oblivious transfer protocol accounts for up to 96.64\% of execution time. To address these challenges, we propose POTA, a high-performance pipelined OT hardware acceleration architecture supporting the silent OT protocol. Finally, we implement a POTA prototype on Xilinx VCU129 FPGAs. Experimental results demonstrate that under various network settings, POTA achieves significant speedups, with maximum improvements of $22.67 imes$ for OT efficiency and $192.57 imes$ for basic operations in MPC applications.

TS23 Reconfigurable systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 11:00 CEST - 12:30 CEST
Location / Room: St Clair 3AB

Session chair:
Christos Bouganis, Imperial College London, UK

Session co-chair:
Maria Michael, University of Cyprus, CY

Time	Label	Presentation Title Authors
11:00 CEST	TS23.1	FPGA-BASED ACCELERATION OF MCMC ALGORITHM THROUGH SELF-SHRINKING FOR BIG DATA Speaker: Shuanglong Liu, Hunan Normal University, CN Authors: Shuanglong Liu, Shiyu Peng and Wan Shen, Hunan Normal University, CN Abstract Markov chain Monte Carlo (MCMC) algorithms are widely used in Bayesian inference to compute the posterior distribution of complex models, facilitating sampling from probability distributions. However, the computational burden of evaluating the likelihood function in MCMC poses significant challenges in big data applications. To address this, sub-sampling methods have been introduced to approximate the target distribution by using subsets of the data rather than the entire dataset. Unfortunately, these methods often lead to biased samples, making them impractical for real-world applications. This paper proposes a novel scaling MCMC method that achieves exact sampling by utilizing a subset (mini-batch) of the data with locally bounded approximations of the target distribution. Our method adaptively adjusts the mini-batch size by automatically tuning a hyperparameter based on the sample acceptance ratio, ensuring optimal balance between sample efficiency and computational cost. Moreover, we introduce a highly optimized hardware architecture to efficiently implement the proposed MCMC method onto FPGA. Our accelerator is evaluated on an AMD Zynq UltraScale+ FPGA device using a Bayesian logistic regression model on the MNIST dataset. The results demonstrate that our design achieves unbiased sampling with a 47.6 times speedup over the standard MCMC design, while also significantly reducing estimation errors compared to state-of-the-art MCMC methods.
11:05 CEST	TS23.2	ATE-GCN: AN FPGA-BASED GRAPH CONVOLUTIONAL NETWORK ACCELERATOR WITH ASYMMETRICAL TERNARY QUANTIZATION Speaker: Ruiqi Chen, Vrije Universiteit Brussel, BE Authors: Ruiqi Chen¹, Jiayu Liu², Shidi Tang³, Yang Liu⁴, Yanxiang Zhu⁵, Ming Ling³ and Bruno da Silva¹ ¹Vrije Universiteit Brussel, BE; ²University College London, GB; ³Southeast University, CN; ⁴Fudan University, CN; ⁵VeriMake Innovation Laboratory, CN Abstract Ternary quantization can effectively simplify matrix multiplication, which is the primary computational operation in neural network models. It has shown success in FPGA-based accelerator designs for emerging models such as GAT and Transformer. However, existing ternary quantization methods can lead to substantial accuracy loss under certain weight distribution patterns, such as GCN. Furthermore, current FPGA-based ternary weight designs often focus on reducing resource consumption while neglecting full utilization of FPGA DSP blocks, limiting maximum performance. To address these challenges, we propose ATE-GCN, an FPGA-based asymmetrical ternary quantization GCN accelerator using a software-hardware co-optimization approach. First, we adopt an asymmetrical quantization strategy with specific interval divisions tailored to the bimodal distribution of GCN weights, reducing accuracy loss. Second, we design a unified processing element (PE) array on FPGA to support various matrix computation forms, optimizing FPGA resource usage while leveraging the benefits of cascade design and ternary quantization, significantly boosting performance. Finally, we implement the ATE-GCN prototype on the VCU118 FPGA board. The results show that ATE-GCN maintains an accuracy loss below 2%. Additionally, ATE-GCN achieves average performance improvements of 224.13ˆ and 11.1ˆ, with up to 898.82ˆ and 69.9ˆ energy consumption saving compared to CPU and GPU, respectively. Moreover, compared to state-of-the-art FPGA-based GCN accelerators, ATE-GCN improves DSP efficiency by 63% with an average latency reduction of 11%.
11:10 CEST	TS23.3	PREVV: ELIMINATING STORE QUEUE VIA PREMATURE VALUE VALIDATION FOR DATAFLOW CIRCUIT ON FPGA Speaker: Kuangjie Zou, Fudan University, CN Authors: Kuangjie Zou, Yifan Zhang, Zicheng Zhang, Guoyu Li, Jianli Chen, Kun Wang and Jun Yu, Fudan University, CN Abstract Dynamic scheduling in high-level synthesis (HLS) maximizes pipeline performance by enabling out-of-order scheduling of load and store requests at runtime. However, this method introduces unpredictable memory dependencies, leading to data disambiguation challenges. Load-store queues (LSQs), commonly used in superscalar CPUs, offer a potential solution for HLS. However, LSQs in dynamically scheduled HLS implementations often suffer from high resource overhead and scalability limitations. In this paper, we introduce PreVV, an architecture based on premature value validation designed to address memory disambiguation with minimal resource overhead. Our approach substitutes LSQ with several PreVV components and a straightforward premature queue. We prevent potential deadlocks by incorporating a specific tag that can send 'fake' tokens to prevent the accumulation of outdated data. Furthermore, we demonstrate that our design has scalability potential. We implement our design using several hardware templates and an LLVM pass to generate targeted dataflow circuits with PreVV. Experimental results on various benchmarks with data hazards show that, compared to state-of-the-art dynamic HLS, PreVV16 (a version with a premature queue depth of 16) reduces LUT usage by 43.91% and FF usage by 33.09%, with minimal impact on timing performance. Meanwhile, PreVV64 (a version with a premature queue depth of 64) reduces LUT usage by 27.21% and FF usage by 33.10%, without affecting timing performance.
11:15 CEST	TS23.4	PEARL: FPGA-BASED REINFORCEMENT LEARNING ACCELERATION WITH PIPELINED PARALLEL ENVIRONMENTS Speaker: Jiayi Li, Peking University, CN Authors: Jiayi Li, Hongxiao Zhao, Wenshuo Yue, Yihan Fu, Daijing Shi, Anjunyi Fan, Yuchao Yang and Bonan Yan, Peking University, CN Abstract Reinforcement learning (RL) is an effective machine learning approach that enables artificial intelligence agents to perform complex tasks and make decisions in dynamic situations. Training an RL agent demands its repetitive interaction with the environment to learn optimal policies. To efficiently collect training data, parallelizing environments is a widely used technique by enabling simultaneous interactions between multiple agents and environments. However, existing CPU-based RL software frameworks face a key challenge of slow multi-environmental update computation. To solve this problem, we present a novel FPGA-based RL accelerating framework--PEARL. PEARL instantiates multiple parallel environments and accelerates them with a carefully designed pipeline scheme to hide data transfer latency within the computation time. We evaluate PEARL on respective RL environments and achieve 4.36× to 972.6× speedup over the existing fastest software-based framework for parallel environment execution. When scaling the number of environments from 1024 to 43008 (42×) in CliffWalking benchmark, the power consumption increases marginally by 3%, while LUT and flip-flops utilization rise by 2.24× and 3.08×, respectively. This demonstrates efficient resource usage and power management in PEARL. Further, PEARL allows users to define and add their environments within the framework flexibly. We have established an open-source repository for users to utilize and expand. We also implement PEARL with the existing RL algorithm and achieve acceleration. It is available online at https://github.com/Selinaee/FPGA_Gym.
11:20 CEST	TS23.5	AISPGEMM: ACCELERATING IMBALANCED SPGEMM ON FPGAS WITH FLEXIBLE INTERCONNECT AND INTRA-ROW PARALLEL MERGING Speaker: Yuanfang Wang, Fudan University, CN Authors: Enhao Tang¹, Shun Li², Hao Zhou³, Guohao Dai³, Jun Lin⁴ and Kun Wang¹ ¹Fudan University, CN; ²Southeast University, CN; ³Shanghai Jiao Tong University, CN; ⁴Nanjing University, CN Abstract The row-wise product algorithm shows significant potential for sparse matrix-matrix multiplication (SpGEMM) on hardware accelerators. Recent studies have made notable progress in accelerating SpGEMM using this algorithm. However, several challenges remain in accelerating imbalanced SpGEMM, where the distribution of non-zero elements across different rows is imbalanced. These challenges include: (1) the fixed dataflow of the merger tree, which leads to lower PE utilization, and (2) highly imbalanced data distributions, such as single rows with numerous non-zero elements, which result in intensive computations. This imbalance significantly challenges SpGEMM acceleration, leading to time-consuming processes that dominate overall computation time. In this paper, we propose AiSpGEMM to accelerate imbalanced SpGEMM on FPGAs. First, we improved the C2SR format to adapt it for imbalanced SpGEMM acceleration based on the row-wise product algorithm. This reduces off-chip memory bank conflicts and increases data reuse of matrix B. Secondly, we design a reconfigurable merger (R-merger) with flexible interconnects to improve PE utilization. Additionally, we propose an intra-row parallel merging algorithm and its corresponding hardware architecture, the parallel merger (P-merger), to accelerate intensive operations. Experimental results demonstrate that AiSpGEMM achieves a geometric mean (geomean) speedup of 5.8× compared to the state-of-the-art FPGA-based SpGEMM accelerator. In Geomean, AiSpGEMM achieves a 3.0× speedup and a 9.8× improvement in energy efficiency compared to the NVIDIA cuSPARSE library running on an NVIDIA A6000 GPU. Moreover, AiSpGEMM-21 demonstrated a 4× increase in average throughput compared to the same GPU.
11:25 CEST	TS23.6	FAMERS: AN FPGA ACCELERATOR FOR MEMORY-EFFICIENT EDGE-RENDERED 3D GAUSSIAN SPLATTING Speaker: Yuanfang Wang, Fudan University, CN Authors: Yuanfang Wang, Yu Li, Jianli Chen, Jun Yu and Kun Wang, Fudan University, CN Abstract This paper introduces FAMERS, a tile-based hardware accelerator designed for efficient 3D Gaussian Splatting (3DGS) inference on edge-deployed Field Programmable Gate Arrays (FPGAs). 3DGS has emerged as a powerful technique for photorealistic image rendering, leveraging anisotropic Gaussians to balance computational efficiency and visual fidelity. However, the high memory and processing demands of 3DGS pose significant challenges for real-time applications on resource-constrained edge devices. To address these limitations, we present a novel architecture that optimizes both computational and memory overheads through model pruning and compression techniques, enabling high-quality rendering within the constrained memory and processing capabilities of edge platforms. Experimental results demonstrate that our implementation on the Xilinx XC7K325T FPGA achieves a 1.99× speedup and 13.46× energy efficiency compared to NVIDIA RTX 3060M Laptop GPU, underscoring the viability of our approach for real-time applications in virtual and augmented reality.
11:30 CEST	TS23.7	SMARTMAP: ARCHITECTURE-AGNOSTIC CGRA MAPPING USING GRAPH TRAVERSAL AND REINFORCEMENT LEARNING Speaker: Ricardo Ferreira, Federal University of Viçosa, BR Authors: Fábio Ramos¹, Pedro Realino¹, Wagner Junior¹, Alex Vieira², Ricardo Ferreira¹ and José Nacif¹ ¹Federal University of Viçosa, BR; ²Federal University of Juiz de Fora, BR Abstract Coarse-Grained Reconfigurable Architectures (CGRAs) have been the subject of extensive research due to their balance between performance, energy efficiency, and flexibility. CGRAs must be capable of executing a dataflow graph (DFG), which depends on a compiler producing quality valid mappings with feasible running time performance and portable mapping DFGs on different CGRA architectures. Machine learning-based compilers have shown promising results by presenting high quality and performance but offer limited portability. Moreover, some approaches do not explore efficient placement methods or do not demonstrate whether scaling to more challenging, less connected architectures. This paper presents SmartMap, an architecture-agnostic framework that uses an actor-critic reinforcement learning method applied to a Monte-Carlo Tree Search (MCTS) to learn how to map a DFG onto a CGRA. This framework offers full portability using a state-action representation layer in the policy network instead of a probability distribution over actions. SmartMap uses a graph traversal placement method to provide scalability and improve the efficiency of MCTS by enabling more efficient exploration during the search. Our results show that SmartMap has 2.81x more mapping capacity, a 16.82x speed-up in compilation time, and consumes fewer resources compared to the state-of-the-art.
11:35 CEST	TS23.8	DATAFLOW OPTIMIZED RECONFIGURABLE ACCELERATION FOR FEM-BASED CFD SIMULATIONS Speaker: Aggelos Ferikoglou, National TU Athens, GR Authors: Anastassis Kapetanakis, Aggelos Ferikoglou, Georgios Anagnostopoulos and Sotirios Xydis, National TU Athens, GR Abstract Computational Fluid Dynamics (CFD) simulations are essential for analyzing and optimizing fluid flows in a wide range of real-world applications. These simulations involve approximating the solutions of the Navier-Stokes differential equations using numerical methods, which are highly compute- and memory-intensive due to their need for high-precision iterations. In this work, we introduce a high-performance FPGA accelerator specifically designed for numerically solving the Navier-Stokes equations. We focus on the Finite Element Method (FEM) due to its ability to accurately model complex geometries and intricate setups typical of real-world applications. Our accelerator is implemented using High-Level Synthesis (HLS) on an AMD Alveo U200 FPGA, leveraging the reconfigurability of FPGAs to offer a flexible and adaptable solution. The proposed solution achieves 7.9x higher performance than optimized Vitis-HLS implementations and 45% lower latency with 3.64x less power compared to a software implementation on a high-end server CPU. This highlights the potential of our approach to solve Navier-Stokes equations more effectively, paving the way for tackling even more challenging CFD simulations in the future.
11:40 CEST	TS23.9	A RESOURCE-AWARE RESIDUAL-BASED GAUSSIAN BELIEF PROPAGATION ACCELERATOR TOOLFLOW Speaker: Omar Sharif, Imperial College London, GB Authors: Omar Sharif and Christos Bouganis, Imperial College London, GB Abstract Gaussian Belief Propagation (GBP) is a graphical method of statistical inference that provides an approximate solution to the probability distribution of a system. In recent years, GBP has emerged as a powerful computational framework with numerous applications in domains such as SLAM and image processing. In pursuit of high performance efficiency (i.e., inference per watt), streaming-based reconfigurable hardware solutions have demonstrated significant performance gains compared to leading-edge processors and high-power, server-grade CPUs. However, this class of architectures suffers from performance degradation at scale when on-chip memory is limited. This paper addresses this challenge by building on previous GBP architectural and algorithmic developments, introducing a novel hardware method that dynamically prioritizes node computations by monitoring information gain. By leveraging the inherent properties of the GBP algorithm, we demonstrate how convergence-driven optimizations can push the performance envelope of state-of-the-art reconfigurable accelerators despite on-chip memory constraints. The performance of our architecture is rigorously evaluated against this across both real-world and synthetic SLAM and image-denoising benchmarks. For equal resources, our work achieves a convergence rate improvement of up to 3.5x for large graphs, demonstrating its effectiveness for real-time inference tasks.
11:45 CEST	TS23.10	UNIT: A HIGHLY UNIFIED AND MEMORY-EFFICIENT FPGA-BASED ACCELERATOR FOR TORUS FHE Speaker: Yuying ZHANG, The Hong Kong University of Science and Technology, HK Authors: Yuying ZHANG¹, Sharad Sinha², Jiang Xu³ and Wei Zhang¹ ¹The Hong Kong University of Science and Technology, HK; ²Indian Institute of Technology (IIT) Goa, IN; ³The Hong Kong University of Science and Technology (Guangzhou), CN Abstract Fully Homomorphic Encryption (FHE) has emerged as a promising solution for the secure computation on encrypted data without leaking user privacy. Among various FHE schemes, Torus FHE (TFHE) distinguishes itself by its ability to perform exact computations on non-linear functions within the encrypted domain, satisfying the crucial requirement for privacy-preserving AI applications. However, the high computational overhead and strong data dependency in TFHE's bootstrapping process present significant challenges to its practical adoption and efficient hardware implementation. Existing TFHE accelerators on various hardware platforms still face limitations in terms of performance, flexibility, and area efficiency. In this work, we propose UNIT, a novel and highly unified accelerator for Programmable Bootstrapping (PBS) in TFHE, featuring carefully designed computation units. We introduce a unified architecture for negacyclic (inverse) number theoretic transform (I)NTT with fused twisting steps, which reduces computing resources by 33% and the memory utilization of pre-stored factors by nearly 66%. Another key feature of UNIT is the innovative design of the monomial number theoretic transform unit, called OF-MNTT, which leverages on-the-fly twiddle factor generation to eliminate memory traffic and overhead. This memory-efficient and highly parallelizable approach for MNTT is proposed for the first time in TFHE acceleration. Furthermore, UNIT is highly reconfigurable and scalable, supporting various parameter sets and performance-resource requirements. Our proposed accelerator is evaluated on the Xilinx Alveo U250 FPGA platform. Experimental results demonstrate its superior performance compared to the state-of-the-art GPU and FPGA-based implementations with the improvement of 8.3x and 3.63x, respectively. In comparison with the most advanced FPGA implementation, UNIT achieves 30% enhanced area efficiency and 3.2x reduced power with much better flexibility.
11:50 CEST	TS23.11	RGHT-Q: RECONFIGURABLE GEMM UNIT FOR HETEROGENEOUS-HOMOGENEOUS TENSOR QUANTIZATION Speaker: Seungho Lee, Sungkyunkwan University, KR Authors: Seungho Lee, Donghyun Nam and Jeongwoo Park, Sungkyunkwan University, KR Abstract The high computational demands of large language models (LLMs) are limited by the lack of GPU hardware support for heterogeneous quantization, which mixes integers and floating points. To address this limitation, we propose an LLM processing element (PE), RGHT-Q, which features reconfigurable general-matrix multiplication (GEMM) for both heterogeneous and homogeneous tensor quantization. The RGHT-Q introduces a novel design that leverages butterfly routing and multi-precision multipliers. As a result, we achieve significant performance improvements, offering 3.14× higher energy efficiency, and 1.56× better area efficiency compared to prior designs.

LK03 Special Day Emerging Computing Paradigms Lunchtime Panel

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 13:15 CEST - 14:00 CEST
Location / Room: Auditorium Pasteur

Session chair:
John Paul Strachan, Forschungszentrum Juelich GmbH, DE

13:15 CEST LK03.1 SPECIAL DAY EMERGING COMPUTING PARADIGMS LUNCHTIME PANEL
Speaker:
Mihai Petrovici, University of Bern; Julie Grollier, CNRS/Thales; Thomas Van Vaerenbergh, Hewlett Packard Labs; Tommaso Calarco, University of Cologne; Sebastian Risi, IT Copenhagen; Corentin Delacour, University of California Santa Barbara - DATE25@Lyon, FR
Abstract
Panel – Emerging computing paradigms

FS04 Focus Session - Designing Secure Space Systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Auditorium Pasteur

Session chair:
Sebastian Steinhorst, TU Munich, DE

Session co-chair:
Daniel Lüdtke, German Aerospace Center (DLR), DE

Organisers:
Sebastian Steinhorst, TU Munich, DE
Michael Felderer, German Aerospace Center (DLR), DE

As the scope of space exploration expands, the need for robust cybersecurity measures has become more urgent than ever. Nowadays, private companies are entering the space sector, leading to a big increase in satellite launches and space activities. While this expansion reduces launch costs, it also elevates the risk of cyber threats. Historically, cybersecurity in space has been overlooked, leaving critical vulnerabilities exposed. This hot-topic session will bring together four experts from industry, government, and research to tackle the critical challenges and explore innovative solutions in building a secure space ecosystem.

Time	Label	Presentation Title Authors
14:00 CEST	FS04.1	OFFENSIVE SECURITY TESTING FOR SPACE SYSTEMS Presenter: Milenko Starcik, VisionSpace Technologies GmbH, DE Author: Milenko Starcik, VisionSpace Technologies GmbH, DE Abstract Space missions, especially commercial space systems, are targeted by state-backed Advanced Persistent Threat (APT) actors since they increasingly share capacity between government and private users. The attacks often exploit legacy hardware, software, and outdated protocols. Legacy system vulnerabilities and the effects of the COVID-19 pandemic have further exposed space systems to potential exploitation. Recently, there have been incidents, such as attacks on satellite terminals with the widespread impact of the 2022 ViaSat incident, showing how legacy systems have led to a security breach. While the space systems community has a strong safety and test engineering history, security validation is often neglected. Our security research on currently used space protocols, mission control software, and spacecraft onboard software frameworks shows that security measures are still not applied throughout the space mission life cycle.
14:22 CEST	FS04.2	LOCKING YOUR DOOR DOES NOT MAKE YOU SECURE AT YOUR HOME, SIMILARLY YOUR SATELLITE! Presenter: Zain Hammadeh, German Aerospace Center (DLR), DE Author: Zain Hammadeh, German Aerospace Center (DLR), DE Abstract Securing the link between the ground segment and the satellite is essential to protect the satellite from cyber-attacks. Solutions including end-to-end encryption can help avoid attacks like spoofing and reply attacks. However, developers of on-board software should not assume that a satellite environment is secure, especially in an era where a satellite will serve as an execution service for 3rd party software, which can be malicious. Efficient intrusion detection systems (IDS) are essential for monitoring network traffic and system behavior to identify malicious activities in real-time. Additionally, an effective intrusion response mechanism must be in place to ensure that the satellite can continue functioning even under attack. This requires a fail-operational mode that guarantees essential systems remain operational while isolating and neutralizing compromised components. Given the constraints on computational resources in space systems, these security solutions must be optimized for low-latency response and minimal resource consumption, all while ensuring high reliability and resilience against evolving cyber threats.
14:45 CEST	FS04.3	SECURITY ENGINEERING (NOT JUST) FOR SPACE Presenter: Stefan Langhamme, OHB Digital Connect GmbH, DE Author: Stefan Langhamme, OHB Digital Connect GmbH, DE Abstract "s space exploration advances and the commercialization of space technologies grows, the security of space assets has become a critical concern. In a related trend the use of "off the shelf" hard- and software facilitates the commercial use of space, but also creates new attack surfaces. This creates a need for off the shelf solutions for security risks. And while a lot of very good solutions exist, experience shows that adding "security" to a system does not automatically lead to an increase in security. This was just recently demonstrated by the global IT outage caused by the CrowdStrike security software. What is needed is the integration of cybersecurity into the engineering lifecycle. In this talk we will investigate ways in which the diverse field of cybersecurity - ranging from organisational and management questions to deeply technical topics – can be integrated into the engineering lifecycle of space systems. The underlying aim is improving the security stance of the system without adding new problems or unnecessary complexity. Key areas covered include threat modelling, risk assessment, secure software and hardware design, encryption, and response strategies. Our aim is to deepen the listeners understanding of what security is, how to achieve it and how to learn from mistakes made in "non-space" IT systems."
15:07 CEST	FS04.4	A JOINT EFFORT: STANDARDIZATION OF CYBERSECURITY IN SPACE Presenter: Florian Göhler, Germany's Federal Office for Information Security (BSI), DE Author: Florian Göhler, Germany's Federal Office for Information Security (BSI), DE Abstract Cybersecurity should be an integrated part of every space mission, and security aspects need to be considered throughout all phases of a project. However, there is a lack of universally applicable security standards that address cyberthreats in space, as existing security standards often miss security measures against space-specific threats. Especially small institutions, start-ups, and research facilities suffer from this lack of guidance, but the issue is also pressing for established industry stakeholders. To overcome this situation, the German Federal Office for Information Security founded an expert group for cybersecurity in space that invites experts from governmental institutions, industry, and academics to work together on standardization and regulation. In a joint effort and based on existing standards, the expert group developed multiple documents that aim to mitigate cyberthreats on space and ground segments. These guidelines focus on every life cycle phase of a space mission, and they are adaptable to the scope and complexity of any given project. Furthermore, the expert group aims to identify emerging new technologies and regulations that may impact cybersecurity in space. These efforts also take international developments into account.

HSD02 HackTheSilicon DATE

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 18:00 CEST
Location / Room: Rhône 4

For those interested in learning more about HackTheSilicon and hardware security competitions, this session offers an in-depth exploration of organizing and competing in world-class hardware security challenges.
Session Highlights:
Talk by HackTheSilicon Team: Insights into organizing large-scale hardware security competitions, including HACK@ competitions.
Competitor Presentations: Teams from Phase 2 of HACK@DATE'25 will share their experiences, including:
Methods for detecting vulnerabilities.
Tools and techniques used in vulnerability analysis.
Exploitation strategies and lessons learned.
Open Discussion & Networking: Attendees will have the chance to interact with competition organizers and top participants to explore career opportunities and technical advancements in hardware security.
This session is highly recommended for anyone interested in hardware security, vulnerability detection, and ethical hacking. Whether you are a student, researcher, or industry professional, this session offers valuable knowledge and networking opportunities.

LKS04 Later … with the keynote speakers

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:00 CEST
Location / Room: Terreaux VIP Lounge

TS24 Logical analysis and design

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Salon Pasteur

Session chair:
Michele Lora, Univerità di Verona, IT

Session co-chair:
Luigi Capogrosso, Univerità di Verona, IT

Time	Label	Presentation Title Authors
14:00 CEST	TS24.1	HYBRID EXACT AND HEURISTIC EFFICIENT TRANSISTOR NETWORK OPTIMIZATION FOR MULTI-OUTPUT LOGIC Speaker: Lang Feng, Sun Yat-sen University, CN Authors: Lang Feng¹, Rongjian Liang² and Hongxin Kong³ ¹Sun Yat-sen University Shenzhen Campus, CN; ²NVIDIA Corp., US; ³Texas A&M University, US Abstract With the approaching post-Moore era, it is becoming increasingly impractical to decrease the transistor size in digital VLSI for better performance. To address this issue, one approach is to optimize the digital circuit at the transistor level to reduce the transistor count. Although previous works have explored ways to conduct transistor network optimization, most of these efforts have focused on single-output networks or applied heuristics only, limiting their scope or optimization quality. In this paper, we propose an exact transistor network optimization algorithm that supports multi-output logic and is formulated as a SAT problem. Our approach maintains a high optimization level by employing the exact algorithm, while also incorporating a hybrid process that uses a heuristic algorithm to predict the solution range as a guidance for better efficiency. Experimental results show that the proposed algorithm has a 5.32% better optimization level given 54% less runtime compared with the state-of-the-art work.
14:05 CEST	TS24.2	MAXIMUM FANOUT-FREE WINDOW ENUMERATION: TOWARDS MULTI-OUTPUT SUB-STRUCTURE SYNTHESIS Speaker: Ruofei TANG, Hong Kong Baptist University, CN Authors: Ruofei Tang¹, Xuliang Zhu², Xing Li³, Lei Chen⁴, Xin Huang¹, Mingxuan Yuan⁴ and Jianliang Xu⁵ ¹Hong Kong Baptist University, HK; ²Antai College of Economics and Management, Shanghai Jiaotong University, CN; ³Huawei Noah's Ark Lab, CN; ⁴Huawei Noah's Ark Lab, HK; ⁵HKBU, CN Abstract Peephole optimization is commonly used in And-Inverter Graphs (AIGs) optimization algorithms. The efficiency of these algorithms heavily relies on the enumeration process of sub-structures. One common sub-structure is the cut, known for its efficient enumeration method and single-output characteristic. However, an increasing number of optimization algorithms now target sub-structures that incorporate multiple outputs. In this paper, we explore Maximum Fanout-Free Windows (MFFWs), a novel sub-structure with a multi-output nature, as well as its practical applications and enumeration algorithms. To accommodate various algorithm execution processes, we propose two different enumeration styles: Dynamic and Static. The Dynamic approach provides flexibility in adapting to changes in the AIG structure, whereas the Static method ensures efficiency as long as the AIG structure remains unchanged during execution. We apply these methods to rewriting and technology mapping to improve their runtime performance. Experimental results on pure enumeration and practical scenarios show the scalability and efficiency of the proposed MFFW enumeration methods.
14:10 CEST	TS24.3	SIMGEN: SIMULATION PATTERN GENERATION FOR EFFICIENT EQUIVALENCE CHECKING Speaker: Carmine Rizzi, ETH Zurich, CH Authors: Carmine Rizzi¹, Sarah Brunner¹, Alan Mishchenko² and Lana Josipovic¹ ¹ETH Zurich, CH; ²University of California, Berkeley, US Abstract Combinational equivalence checking for hardware design tends to be slow due to the number and complexity of intermediate node equivalences considered by the SAT solver. This is because the solver often spends extensive time disproving nodes that appear equivalent under random simulation. We propose SimGen, an open-source and expressive simulation pattern generator inspired by Automatic Test Pattern Generation (ATPG); it exploits the circuit's structure and logic information to disprove the equivalence of circuit nodes and avoid excessive SAT calls. We demonstrate the effectiveness of SimGen's simulation patterns over those generated by state-of-the-art random and guided simulation.
14:15 CEST	TS24.4	ELMAP: AREA-DRIVEN LUT MAPPING WITH K-LUT NETWORK EXACT SYNTHESIS Speaker: Hongyang Pan, Fudan University, CN Authors: Hongyang Pan¹, Keren Zhu¹, Fan Yang¹, Zhufei Chu² and Xuan Zeng¹ ¹Fudan University, CN; ²Ningbo University, CN Abstract Mapping to k-input lookup tables (k-LUTs) is a critical process in field-programmable gate array (FPGA) synthesis. However, the structure of the subject graph can introduce structural bias, which refers to the dependency of mapping results on the inherent graph structure, often leading to suboptimal results. To address this, we present ELMap, an area-driven LUT mapping framework. It incorporates structural choice during the collapsing phase. This enables dynamic decomposition, maximizing local-to-global optimization transfer. To ensure seamless integration between the optimization and mapping processes, ELMap leverages exact k-LUT synthesis to generate area-optimal sub-LUT networks. Experiments on the EPFL benchmark suite demonstrate that ELMap significantly outperforms state-of-theart methods. Specifically, in 6-LUT mapping, ELMap reduces the average LUT area by 8.5% and improves the area-depthproduct (ADP) by 5.8%. In 4-LUT remapping, it reduces the average LUT area by 17.6% and improves the ADP by 2.4%.
14:20 CEST	TS24.5	APPLICATION OF FORMAL METHODS (SAT/SMT) TO THE DESIGN OF CONSTRAINED CODES Speaker: Sunil Sudhakaran, Student, US Authors: Sunil Sudhakaran¹, Clark Barrett² and Mark Horowitz² ¹Student, US; ²Stanford University, US Abstract Constrained coding plays a crucial role in high-speed communication links by restricting bit sequences to reduce the adverse effects imposed by the characteristics of the channel. This technique trades off some bit efficiency for higher transmission rates, thereby boosting overall data throughput. We show how the design of hardware-efficient translation logic to and from the restricted code space can be formulated as a Satisfiability Modulo Theories (SMT) problem. Using SMT, we can not only try to minimize the complexity of this logic and limit the effect of transmission errors on the final decoded output, but also significantly reduce development time—from weeks to just hours. Our initial results demonstrate the efficiency and effectiveness of this approach.
14:25 CEST	TS24.6	WIDEGATE: BEYOND DIRECTED ACYCLIC GRAPH LEARNING IN SUBCIRCUIT BOUNDARY PREDICTION Speaker: Jiawei Liu, Beijing University of Posts and Telecommunications, CN Authors: Jiawei Liu¹, Zhiyan Liu¹, Xun He¹, Jianwang Zhai¹, Zhengyuan Shi², Qiang Xu², Bei Yu² and Chuan Shi¹ ¹Beijing University of Posts and Telecommunications, CN; ²The Chinese University of Hong Kong, HK Abstract Subcircuit boundary prediction is an important application of machine learning in logical analysis, effectively supporting tasks such as functional verification and logic optimization. Existing methods often convert circuits into and-inverter graphs and then use directed acyclic graph neural networks to perform this task. However, two key characteristics of subcircuit boundary prediction do not align with the fundamental assumptions of DAG learning, which limits the model's expressiveness and generalization capabilities. To break these assumptions, we propose WideGate, which includes a receptive field generation module that extends beyond the fanin cone and fanout cone, as well as an adaptive aggregation module that focuses on boundaries. Extensive experiments show that WideGate significantly outperforms existing methods in terms of prediction accuracy and training efficiency for subcircuit boundary prediction. The code is available at https://github.com/BUPT-GAMMA/WideGate.
14:30 CEST	TS24.7	BIAS BY DESIGN: DIVERSITY QUANTIFICATION TO MITIGATE STRUCTURAL BIAS EFFECTS IN AIG LOGIC OPTIMIZATION Speaker: Isabella Venancia Gardner, Universiteit van Amsterdam, NL Authors: Isabella Venancia Gardner¹, Marcel Walter², Yukio Miyasaka³, Robert Wille² and Michael Cochez⁴ ¹Universiteit van Amsterdam, NL; ²TU Munich, DE; ³University of California, Berkeley, US; ⁴Vrije Universiteit Amsterdam, NL Abstract And-Inverter Graphs (AIGs) are a fundamental data structure in logic optimization and are widely used in modern electronic design automation. A persistent challenge in AIG optimization is structural bias, where the initial graph structure significantly influences optimization quality by restricting the search space, often resulting in suboptimal outcomes. Existing methods address this issue by running multiple optimization workflows in parallel, relying on a trial-and-error approach that lacks a systematic way to measure structural diversity or assess effectiveness, making them computationally expensive and inefficient. This paper introduces a novel framework for systematically evaluating and reducing structural bias by measuring structural diversity, defined as the degree of dissimilarity between AIG graphs. Several traditional graph similarity measures and newly proposed AIG-specific metrics, including the Rewrite, Refactor, and Resub Scores, are explored. Results reveal limitations in traditional graph similarity metrics and highlight the effectiveness of the proposed AIG-specific measures in quantifying structural dissimilarity. Notably, the RRR Score shows a strong correlation (Pearson correlation coefficient, r = 0.79) with post-optimization structural differences, demonstrating the reliability of the metric in capturing meaningful variations between AIG structures. This work addresses the challenge of quantifying structural bias and offers a methodology that could potentially improve optimization outcomes, with future extensions applicable to other types of logic graphs.
14:35 CEST	TS24.8	TIMING-DRIVEN APPROXIMATE LOGIC SYNTHESIS BASED ON DOUBLE-CHASE GREY WOLF OPTIMIZER Speaker: Xiangfei Hu, Southeast University, CN Authors: Xiangfei Hu¹, Yuyang Ye², Tinghuan Chen³, Hao Yan¹ and Bei Yu² ¹Southeast University, CN; ²The Chinese University of Hong Kong, HK; ³The Chinese University of Hong Kong, Shenzhen, CN Abstract With the shrinking technology nodes, timing optimization becomes increasingly challenging. Approximate logic synthesis (ALS) can perform local approximate changes (LACs) on circuits to optimize timing with the cost of slight inaccuracy. However, existing ALS methods that focus solely on critical path depth reduction or area minimization are not optimal in timing optimization. This paper proposes an effective timing-driven ALS framework, where we employ a double-chase grey wolf optimizer to explore and apply LACs, simultaneously bringing excellent critical path shortening and area reduction under error constraints. Subsequently, it utilizes post-optimization under area constraints to convert area reduction into further timing improvement, thus achieving maximum critical path delay reduction. According to experiments on open-source circuits with 28nm technology, compared to the SOTA method, our framework can generate approximate circuits with greater critical path delay reduction under different error and area constraints.
14:40 CEST	TS24.9	IRW: AN INTELLIGENT REWRITING Speaker: Haisheng Zheng, Shanghai Artificial Intelligence Laboratory, CN Authors: Haisheng Zheng¹, Haoyuan WU², Zhuolun He², Yuzhe Ma³ and Bei Yu² ¹Shanghai Artificial Intelligence Laboratory, CN; ²The Chinese University of Hong Kong, HK; ³The Hong Kong University of Science and Technology (Guangzhou), CN Abstract This paper proposes a novel machine learning-driven rewriting algorithm to optimize And-Inverter Graphs (AIGs) for refining combinational logic prior to technology mapping. The algorithm, called iRw, iteratively extracts subcircuits in AIGs and replaces them with more streamlined implementations. These subcircuits are identified using an original extraction algorithm, while the compact implementations are produced through rewriting techniques guided by a machine learning model. This approach efficiently enables the generation of logically equivalent subcircuits with minimal overhead. Experiments on benchmark circuits indicate that the proposed methodology outperforms state-of-the-art AIG rewriting techniques in both quality and runtime.
14:41 CEST	TS24.10	AUTOMATIC ROUTING FOR PHOTONIC INTEGRATED CIRCUITS UNDER DELAY MATCHING CONSTRAINTS Speaker: Yuchao Wu, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Yuchao Wu¹, Weilong Guan¹, Yeyu Tong² and Yuzhe Ma¹ ¹The Hong Kong University of Science and Technology (Guangzhou), CN; ²The Hong Kong University of Science and Technology (Guangzhou)), CN Abstract Optical interconnects have emerged as a promising solution for rack-, board-scale, and even in-package communications, thanks to their high available optical bandwidth and minimal latency. However, the optical waveguides are intrinsically different from traditional metal wires, especially the phase matching constraints, which impose new challenges for routing in the photonic integrated circuits design. In this paper, we propose a comprehensive and efficient optical routing framework that introduces a diffuse-based length-matching method and bend modification methods to ensure phase-matching constraints. Furthermore, we present a congestion-based A* formulation with a negotiated congestion-based rip-up and reroute strategy on new rectangular grids with an aspect ratio of 1:√3 to reduce insertion loss. Experimental results based on real photonic integrated designs show that our optical routing flow can reduce total insertion loss by 11% and maximum insertion loss by 108%, while effectively satisfying matching constraints, compared to manual results.
14:42 CEST	TS24.11	ML-BASED AIG TIMING PREDICTION TO ENHANCE LOGIC OPTIMIZATION Speaker: Sachin Sapatnekar, University of Minnesota, US Authors: Wenjing Jiang¹, Jin Yan² and Sachin S. Sapatnekar¹ ¹University of Minnesota, US; ²Google, US Abstract Traditional logic optimization relies on proxy metrics to approximate post-mapping performance and area, which may not correlate well with post-mapping delay and area. This paper explore a ground-truth-based optimization flow that directly incorporates the post-mapping delay and area during optimization using decision tree-based machine learning models. Results show high prediction accuracy and generalization to unseen designs.

TS25 Design and Test for Machine Learning and Machine Learning for Design and Test

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: St Clair 3AB

Session chair:
Annachiara Ruospo, Politecnico di Torino, IT

Session co-chair:
Antonio Beck, UFRGS, BR

Time	Label	Presentation Title Authors
14:00 CEST	TS25.1	HYATTEN: HYBRID PHOTONIC-DIGITAL ARCHITECTURE FOR ACCELERATING ATTENTION MECHANISM Speaker: Huize Li, National University of Singapore, SG Authors: Huize Li, Dan Chen and Tulika Mitra, National University of Singapore, SG Abstract The wide adoption and substantial computational resource requirements of attention-based Transformers have spurred the demand for efficient hardware accelerators. Unlike digital-based accelerators, there is growing interest in exploring photonics due to its high energy efficiency and ultra-fast processing speeds. However, the significant signal conversion overhead limits the performance of photonic-based accelerators. In this work, we propose HyAtten, a photonic-based attention accelerator with minimize signal conversion overhead. HyAtten incorporates a signal comparator to classify signals into two categories based on whether they can be processed by low-resolution converters. HyAtten integrates low-resolution converters to process all low resolution signals, thereby boosting the parallelism of photonic computing. For signals requiring high-resolution conversion, HyAtten uses digital circuits instead of signal converters to reduce area and latency overhead. Compared to state-of-the-art photonicbased Transformer accelerator, HyAtten achieves 9.8× performance/area and 2.2× energy-efficiency/area improvement.
14:05 CEST	TS25.2	SEGA-DCIM: DESIGN SPACE EXPLORATION-GUIDED AUTOMATIC DIGITAL CIM COMPILER WITH MULTIPLE PRECISION SUPPORT Speaker: Haikang Diao, Peking University, CN Authors: Haikang Diao, Haoyi Zhang, Jiahao Song, Haoyang Luo, Yibo Lin, Runsheng Wang, Yuan Wang and Xiyuan Tang, Peking University, CN Abstract Digital computing-in-memory (DCIM) has been a popular solution for addressing the memory wall problem in recent years. However, the DCIM design still heavily relies on manual efforts, and the optimization of DCIM is often based on human experience. These disadvantages limit the time to market while increasing the design difficulty of DCIMs. This work proposes a design space exploration-guided automatic DCIM compiler (SEGA-DCIM) with multiple precision support, including integer and floating-point data precision operations. SEGA-DCIM can automatically generate netlists and layouts of DCIM designs by leveraging a template-based method. With a multi-objective genetic algorithm (MOGA)-based design space explorer, SEGA-DCIM can easily select appropriate DCIM designs for a specific application considering the trade-offs among area, power, and delay. As demonstrated by the experimental results, SEGA-DCIM offers solutions with wide design space, including integer and floating-point precision designs, while maintaining competitive performance compared to state-of-the-art (SOTA) DCIMs.
14:10 CEST	TS25.3	SOFTMAP: SOFTWARE-HARDWARE CO-DESIGN FOR INTEGER-ONLY SOFTMAX ON ASSOCIATIVE PROCESSORS Speaker: Mariam Rakka, University of California, Irvine, US Authors: Mariam Rakka¹, Jinhao Li², Guohao Dai³, Ahmed Eltawil⁴, Mohammed Fouda⁵ and Fadi Kurdahi¹ ¹University of California, Irvine, US; ²Shanghai Jiao Tong University, CN; ³Qingyuan Research Institute, Shanghai Jiao Tong University, CN; ⁴King Abdullah University of Science and Technology, SA; ⁵Rain AI, US Abstract Recent research efforts focus on reducing the computational and memory overheads of Large Language Models (LLMs) to make them feasible on resource-constrained devices. Despite advancements in compression techniques, non-linear operators like Softmax and Layernorm remain bottlenecks due to their sensitivity to quantization. We propose SoftmAP, a software-hardware co-design methodology that implements an integer-only low-precision Softmax using In-Memory Compute (IMC) hardware. Our method achieves up to three orders of magnitude improvement in the energy-delay product compared to A100 and RTX3090 GPUs, making LLMs more deployable without compromising performance.
14:15 CEST	TS25.4	COMPREHENSIVE RISC-V FLOATING-POINT VERIFICATION: EFFICIENT COVERAGE MODELS AND CONSTRAINT-BASED TEST GENERATION Speaker: Tianyao Lu, College of Information Science and Electronic Engineering, Zhejiang University, CN Authors: Tianyao Lu, Anlin Liu, Bingjie Xia and Peng Liu, Zhejiang University, CN Abstract The increasing complexity of processor architectures necessitates more rigorous functional verification. Floating-point operations, in particular, present significant challenges due to their extensive range of computational cases that require verification. This paper proposes a comprehensive approach for generating floating-point instruction sequences to enhance the verification of RISC-V. We introduce a constraint-based method for floating-point test generation and design efficient coverage models as input constraints for this process. The resulting representative floating-point tests are integrated with RISC-V instruction sequence generation through a memory-bound register update method. Experimental results demonstrate that our approach improves the functional coverage of RISC-V floating-point instruction sequences from 93.32% to 98.34%, while simultaneously reducing the number of required instructions by 66.67% compared to the Google RISCV-DV generator. Additionally, our method achieves more comprehensive coverage of floating-point types in instruction write-back data compared to RISCV-DV. Using the proposed approach, we successfully detect representative floating-point-related faults injected into the RISC-V processor CV32E40P, thereby demonstrating its effectiveness.
14:20 CEST	TS25.5	WINACC: WINDOW-BASED ACCELERATION OF NEURAL NETWORKS USING BLOCK FLOATING POINT Speaker: Xin Ju, National University of Defense Technology, CN Authors: Xin Ju, Jun He, Mei Wen, Jing Feng, Yasong Cao, Junzhong Shen, Zhaoyun Chen and Yang Shi, National University of Defense Technology, CN Abstract Deep Neural Networks (DNNs) impose significant computational demands, necessitating optimizations for computational and energy efficiencies. Per-vector scaling, which applies a scaling factor to blocks of elements using narrow integer types, effectively reduces storage and computational overhead. However, the frequent occurrence of floating-point accumulations between vectors limits further improvements in energy efficiency. State-of-the-art accelerators address this challenge by grouping and summing vector products based on their exponent differences, thereby reducing the overhead associated with intra-group shifting and accumulation. Nevertheless, this approach increases the complexity of register usage and grouping logic, leading to limited energy benefits and hardware efficiency. In this context, we introduce WinAcc, a novel algorithm and architecture co-designed solution that utilizes a low-cost accumulator to handle the majority of data in DNNs, offering low area overhead and high energy efficiency gains. Our key insight is that the data of DNNs follows a Laplace-like distribution, which enables the use of a customized data format with a narrow dynamic range to encode most of the data. This allows for the design of a low-cost accumulator with narrow shifters and adders, significantly reducing reliance on floating-point accumulator and consequently improving energy efficiency. Compared with state-of-the-art architecture Bucket, WinAcc achieves 33.95% energy reduction across seven representative DNNs and reduces area by 9.5% while maintaining superior model performance.
14:25 CEST	TS25.6	SACPLACE: MULTI-AGENT DEEP REINFORCEMENT LEARNING FOR SYMMETRY-AWARE ANALOG CIRCUIT PLACEMENT Speaker: Lei Cai, Wuhan Unversity of Technology, CN Authors: Lei Cai¹, Guojing Ge², Guibo Zhu², Jixin Zhang³, Jinqiao Wang², Bowen Jia¹ and Ning Xu¹ ¹Wuhan University of Technology, CN; ²Institute of Automation, Chinese Academic of Science, CN; ³Hubei University of Technology, CN Abstract The placement of analog Integrated Circuits (ICs) plays a critical role in their physical design. The objective is to minimize the Half-Perimeter Wire Length (HPWL) while satisfying complex analog IC constraints, such as symmetry. Unlike digital ICs, analog ICs are highly sensitive to parasitic effects, making device symmetry crucial for optimal circuit performance. However, existing methods, including both machine learning-based and analytical approaches, struggle to meet strict symmetry constraints. In machine learning-based methods, training a general model is challenging due to the limited diversity of the training data. In analytical methods, the difficulty lies in formulating symmetry constraints as a convex function, which is necessary for gradient-based optimization of the placement. To address the issue, we formulate the placement process as a Markov decision process and propose SACPlace, a multi-agent deep reinforcement learning method for Symmetry-Aware analog Circuit Placement. SACPlace initially extracts layout information and various constraints as the input information for placement refinement and evaluation. Subsequently, SACPlace constructs multi-agent policy networks for symmetry-aware placement by refining placement guided by the evaluation of optimal symmetry quality. Following this, SACPlace constructs multi-layer perceptron-based critic networks to embed placement information for evaluating symmetry quality. This evaluation reward will be used for guiding placement refinement. Experimental results from four public analog IC datasets demonstrate that our method achieves the lowest HPWL while fully satisfying symmetry and common constraints, outperforming state-of-the-art methods. Additionally, simulation results on real-world analog ICs show better performance than these methods and even manual designs.
14:30 CEST	TS25.7	LINEARIZATION OF QUADRATURE DIGITAL POWER AMPLIFIERS BY NEURAL NETWORK OF ULR_LSTM: UNSUPERVISED LEARNING RESIDUAL LSTM Speaker: Jiayu Yang, State Key Laboratory of Integrated Chips and Systems, School of Microelectronics, Fudan University, Shanghai, China, CN Authors: Jiayu Yang, Luyi Guo, Yicheng Li, Wang Wang, Zixu Li, Manni Li, Zijian Huang, Yinyin Lin, Yun Yin and Hongtao Xu, Fudan University, CN Abstract For the first time, this paper presents an unsupervised learning residual long short-term memory (ULR_LSTM) neural network to develop a digital predistortion (DPD) method for the linearization of digital power amplifiers (DPAs). Our method eliminates the need for iterative learning control (ILC) to obtain the ideal input of the DPA required by state-of-the-arts (SOTAs), which leads to high computational complexity and extensive training time. We perform behavioral modeling of the DPA using the R_LSTM network. After determining the optimal behavioral model architecture, the corresponding DPD model is obtained through an inverse training process. A 15-bit transformer-based quadrature DPA chip incorporating Class-G and IQ-cell-sharing techniques was implemented in a 28nm CMOS process to validate our proposed method. Experimental results demonstrate outstanding linearization performance comparing to prior arts, achieving an error vector magnitude (EVM) of -40.4dB for the 802.11ax 40MHz 64QAM signal.
14:35 CEST	TS25.8	COMPATIBILITY GRAPH ASSISTED AUTOMATIC HARDWARE TROJAN INSERTION FRAMEWORK Speaker: Anjum Riaz, IIT Jammu, IN Authors: Gaurav Kumar, Ashfaq Shaik, Anjum Riaz, Yamuna Prasad and Satyadev Ahlawat, IIT Jammu, IN Abstract Hardware Trojans (HTs) pose substantial security threats to Integrated Circuits (ICs), compromising their integrity, confidentiality, and functionality. Various HT detection methods have been developed to mitigate these risks. However, the limited availability of comprehensive HT benchmarks necessitates designers to create their own for evaluation purposes. Moreover, the existing benchmarks exhibit several deficiencies, including a restricted range of trigger nodes, susceptibility to detection through random patterns, lengthy HT instance creation and validation process, and a limited number of HT instances per circuit. To address these limitations, we propose a Compatibility Graph assisted automatic Hardware Trojan insertion framework for HT benchmark generation. Given a netlist, this framework generates a design incorporating single or multiple HT instances according to user-defined properties. It allows various configurations of HTs, such as a large number of trigger nodes, low activation probability and large number of unique HT instances. The experimental results demonstrate that the generated HT benchmarks exhibit exceptional resistance to state-of-the-art HT detection schemes. Additionally, the proposed framework achieves an average improvement of 37815.7x and 989.4x over the insertion times of the Random and Reinforcement Learning based HT insertion frameworks, respectively.
14:40 CEST	TS25.9	TOWARDS ROBUST RRAM-BASED VISION TRANSFORMER MODELS WITH NOISE-AWARE KNOWLEDGE DISTILLATION Speaker: Wenyong Zhou, The University of Hong Kong, HK Authors: Wenyong Zhou, Taiqiang Wu, Chenchen Ding, Yuan Ren, Zhengwu Liu and Ngai Wong, The University of Hong Kong, HK Abstract Resistive random-access memory (RRAM)-based compute-in-memory (CIM) systems show promise in accelerating Transformer-based vision models but face challenges from inherent device non-idealities. In this work, we systematically investigate the vulnerability of Transformer-based vision models to RRAM-induced perturbations. Our analysis reveals that earlier Transformer layers are more vulnerable than later ones, and feed-forward networks (FFNs) are more susceptible to noise than multi-head self-attention (MHSA). Based on these observations, we propose a noise-aware knowledge distillation framework that enhances model robustness by aligning both intermediate features and final outputs between weight-perturbed and noise-free models. Experimental results demonstrate that our method improves accuracy by up to 1.54% and 1.49% on ViT and DeiT models under various noise conditions compared to their vanilla counterparts.
14:41 CEST	TS25.10	HYIMC: ANALOG-DIGITAL HYBRID IN-MEMORY COMPUTING SOC FOR HIGH-QUALITY LOW-LATENCY SPEECH ENHANCEMENT Speaker: Wanru Mao, Beihang University, CN Authors: Wanru Mao¹, Hanjie Liu¹, Guangyao Wang¹, Tianshuo Bai¹, Jingcheng Gu¹, Han Zhang¹, Xitong Yang², Aifei Zhang², Xiaohang Wei², Meng Wang² and Wang Kang¹ ¹Beihang University, CN; ²Zhicun Research Lab, CN Abstract In-memory computing (IMC) holds significant promise for accelerating deep learning-based speech enhancement (DL-SE). However, existing IMC architectures face challenges in simultaneously achieving high precision, energy efficiency, and the necessary parallelism for DL-SE's inherent temporal dependencies. This paper introduces HyIMC, a novel hybrid analog-digital IMC architecture designed to address these limitations. HyIMC features: 1) a hybrid analog-digital design optimized for DL-SE algorithms; 2) a schedule controller that efficiently manages recurrent dataflow within skip connections; and 3) non-key dimension shrinkage, a model compression technique that preserves accuracy. Implemented on a 40nm eFlash-based IMC SoC prototype, HyIMC achieves 160 TOPS/W energy efficiency, compresses the DL-SE model size by ∼600%, improves the feature of merit by ∼1200%, and enhances perceptual evaluation of speech quality by ∼120%.

TS26 Design and test for analog and mixed-signal circuits / systems / MEMS

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Rhône 3AB

Session chair:
Hung-Ming Chen, National Yang Ming Chiao Tung University, TW

Session co-chair:
Haralampos Stratigopoulos, CNRS / Sorbonné Université, FR

Time	Label	Presentation Title Authors
14:00 CEST	TS26.1	INTO-OA: INTERPRETABLE TOPOLOGY OPTIMIZATION FOR OPERATIONAL AMPLIFIERS Speaker: Jinyi Shen, Fudan University, CN Authors: Jinyi Shen, Fan Yang, Li Shang, Zhaori Bi, Changhao Yan, Dian Zhou and Xuan Zeng, Fudan University, CN Abstract This paper presents INTO-OA, an interpretable topology optimization method for operational amplifiers (op-amps). We propose a Bayesian optimization-based approach to effectively explore the high-dimensional, discrete topology design space of op-amps. Our method integrates a Gaussian process surrogate model with the Weisfeiler-Lehman graph kernel to extract structural features from a dedicated circuit graph representation. It also employs a candidate generation strategy that combines random sampling with mutation to balance global exploration and local exploitation. Additionally, INTO-OA enhances interpretability by assessing the impact of circuit structures on performance, providing designers with valuable insights into generated topologies and enabling the interpretable refinement of existing designs. Experimental results demonstrate that INTO-OA achieves higher success rates, a 1.84× to 19.10× improvement in op-amp performance, and a 3.20× to 14.33× increase in topology optimization efficiency compared to state-of-the-art methods.
14:05 CEST	TS26.2	EFFECTIVE ANALOG ICS FLOORPLANNING WITH RELATIONAL GRAPH NEURAL NETWORKS AND REINFORCEMENT LEARNING Speaker: Davide Basso, University of Trieste, IT Authors: Davide Basso¹, Luca Bortolussi¹, Mirjana Videnovic-Misic² and Husni Habal³ ¹University of Trieste, IT; ²Infineon Technologies AT, AT; ³Infineon Technologies, DE Abstract Analog integrated circuit (IC) floorplanning is typically a manual process with the placement of components (devices and modules) planned by a layout engineer. This process is further complicated by the interdependence of floorplanning and routing steps, numerous electric and layout-dependent constraints, as well as the high level of customization expected in analog design. This paper presents a novel automatic floorplanning algorithm based on reinforcement learning. It is augmented by a relational graph convolutional neural network model for encoding circuit features and positional constraints. The combination of these two machine learning methods enables knowledge transfer across different circuit designs with distinct topologies and constraints, increasing the generalization ability of the solution. Applied to 6 industrial circuits, our approach surpassed established floorplanning techniques in terms of speed, area and half-perimeter wire length. When integrated into a procedural generator for layout completion, overall layout time was reduced by 67.3% with a 8.3% mean area reduction compared to manual layout.
14:10 CEST	TS26.3	FORMALLY VERIFYING ANALOG NEURAL NETWORKS WITH DEVICE MISMATCH VARIATIONS Speaker: Tobias Ladner, TU Munich, DE Authors: Yasmine Abu-Haeyeh¹, Thomas Bartelsmeier², Tobias Ladner³, Matthias Althoff³, Lars Hedrich⁴ and Markus Olbrich² ¹University of Frankfurt, DE; ²Leibniz University Hannover, DE; ³TU Munich, DE; ⁴Goethe University Frankfurt, DE Abstract Training and running inference of large neural networks comes with excessive cost and power consumption. Thus, realizing these networks as analog circuits is an energy- and area-efficient alternative. However, analog neural networks suffer from inherent deviations within their circuits, requiring extensive testing for their correct behavior under these deviations. Unfortunately, tests based on Monte Carlo simulations are extremely time- and resource-intensive. We present an alternative approach to proving the correctness of the neural network using formal neural network verification techniques and developing a modeling methodology for these analog neural circuits. Our experimental results compare two methods based on reachability analysis showing their effectiveness by reducing the test time from days to milliseconds. Thus, they offer a faster, more scalable solution for verifying the correctness of analog neural circuits.
14:15 CEST	TS26.4	POST-LAYOUT AUTOMATED OPTIMIZATION FOR CAPACITOR ARRAY IN DIGITAL-TO-TIME CONVERTER Speaker: Hefei Wang, Southern University of Science and Technology, CN Authors: Hefei Wang¹, Jianghao Su¹, Junhe Xue¹, Haoran Lv¹, Junhua Zhang², Longyang Lin¹, Kai Chen¹, Lijuan Yang² and Shenghua Zhou³ ¹Southern University of Science and Technology, CN; ²International Quantum Academy, CN; ³Southern University of Science and Technology; International Quantum Academy, CN Abstract The integral non-linearity (INL) of Digital-to-Time Converter (DTC) in fractional-N phase-locked loops introduces fractional spurs, especially at near-integer channels, resulting in increased jitter. To meet the strict jitter and spur performance requirements of high-performance wireless transceivers, minimizing the INL in DTC designs is crucial. This work presents a computer-aided, automated optimization methodology that focuses on addressing issues stemming from the uniform capacitor unit structure within the capacitor array in Variable-Slope DTC. These issues include parasitic resistance and capacitance, which distort the charging and discharging behavior of the capacitors, contributing to INL. By systematically optimizing the capacitor layout and mitigating parasitic effects, the methodology allows precise tuning of each capacitor unit in capacitor array to reduce INL, enhancing the overall performance of the DTC.
14:20 CEST	TS26.5	TIME-DOMAIN 3D ELECTROMAGNETIC FIELDS ESTIMATION BASED ON PHYSICS-INFORMED DEEP LEARNING FRAMEWORK Speaker: Huifan Zhang, ShanghaiTech University, CN Authors: Huifan Zhang, Yun Hu and Pingqiang Zhou, ShanghaiTech University, CN Abstract Electromagnetic simulation is important and time-consuming in RF/microwave circuit design. Physics-informed deep learning is a promising method to learn a family of parametric partial differential equations. In this work, we propose a physics-informed deep learning framework to estimate time-domain 3D electromagnetic fields. Our method leverages physics-informed loss functions to model Maxwell's equations which govern electromagnetic fields. Our post-trained model produces accurate results with over 200x speedup over the FDTD simulation. We reduce the mean square error by at least 14% and 15%, with respect to purely data-driven learning and the Fourier operator learning method FNO. In order to optimize data and physical loss simultaneously, we introduce a self-adaptive scaling factors updating algorithm, which has 8.4% less error than the loss balancing method ReLoBRaLo.
14:25 CEST	TS26.6	TPC-GAN: BATCH TOPOLOGY SYNTHESIS FOR PERFORMANCE-COMPLIANT OPERATIONAL AMPLIFIERS USING GENERATIVE ADVERSARIAL NETWORKS Speaker: Jinglin Han, Beihang University, CN Authors: Yuhao Leng¹, Jinglin Han¹, Yining Wang² and Peng Wang¹ ¹Beihang University, CN; ²Corelink Technology（Qingdao）Co.,Ltd., CN Abstract Operational amplifier is one of the most important analog basic blocks. Existing automated synthesis strategies for operational amplifiers solely focus on the optimization of single topology, making them unsuitable for scenarios requiring batch synthesis, such as dataset augmentation. In this paper, we introduce TPC-GAN, a generative model for batch topology synthesis of operational amplifiers in accordance with performance specifications. To be specific, it incorporates a reward network of circuit performance into the adversarial generative networks (GANs). This enables direct synthesis of novel and feasible circuit topology meeting performance specifications. Experimental results demonstrate that our proposed method can achieve a validity rate of 98% in circuit generation, among which 99.7% are novel relative to the training dataset. With the introduction of a reward network, a significant portion (82.8%) of the generated circuits satisfy performance specifications, which is a substantial improvement than those without. Transistor-level experimental results further demonstrate the practicality and competitiveness of our generated circuits with nearly 3x improvement over manual designs.
14:30 CEST	TS26.7	NANOELECTROMECHANICAL BINARY COMPARATOR FOR EDGE-COMPUTING APPLICATIONS Speaker: Victor Marot, University of Bristol, GB Authors: Victor Marot, Manu Krishnan, Mukesh Kulsreshath, Elliott Worsey, Roshan Weerasekera and Dinesh Pamunuwa, University of Bristol, GB Abstract Bitwise comparison is a fundamental operation in many digital arithmetic functions and is ubiquitous in both datapath and control elements; for example, many machine learning algorithms depend on binary comparison. This work proposes a new class of binary comparator circuit using 4-terminal nanoelectromechanical (NEM) relays that use just 6 devices compared to 9 transistors in CMOS implementations. Moreover, NEM implementations are capable of withstanding much higher temperatures, up to 300°C, and radiation levels, well over 1 Mrad absorbed dose, conditions which are common across many industrial edge applications, with near zero standby power. A 1-bit magnitude and equality comparators comprising two in-plane silicon 4-terminal relays each were fabricated on a silicon-on-insulator substrate and electrically characterized for proof of concept, the first such demonstration. Using the 1-bit comparators as building blocks, a scalable tree-based topology is proposed to implement higher-order comparators, resulting in ≈47% reduction in device count over a CMOS implementation for a 64-bit comparator. Circuit level simulations of the comparator using accurate device models show that a single operation consumes at most 21 fJ a 9-fold reduction over the best CMOS offering in an equivalent process node.
14:35 CEST	TS26.8	CLOCK AND POWER SUPPLY-AWARE HIGH ACCURACY PHASE INTERPOLATOR LAYOUT SYNTHESIS Speaker: Hung-Ming Chen, National Yang Ming Chiao Tung University, TW Authors: Siou-Sian Lin, Shih-Yu Chen, Yu-Ping Huang, Tzu-Chuan Lin, Hung-Ming Chen and Wei-Zen Chen, National Yang Ming Chiao Tung University, TW Abstract Due to a popular request from the designers of clock and data recovery (CDR) in the inefficiency of generating high accuracy phase interpolator (PI), in this work, we have developed a layout generator for such circuit, different from conventional constraint-driven works. In the first stage, we propose a customized template floorplanning plus pin generation demanded by the users. In the second stage, in order to generate high accuracy layout, we implement a gridless router for signal, power supply and clock. Experiments with several configurations indicate that our approach can generate high-quality corresponding layouts that align with user expectations, and even surpass the quality of manual designs on structurally regular high-performance PIs, which are not easy and efficient to be generated by prior primitive/grid-based methods.
14:40 CEST	TS26.9	ML-BASED FAST AND ACCURATE PERFORMANCE MODELING AND PREDICTION FOR HIGH-SPEED MEMORY INTERFACES ACROSS DIFFERENT TECHNOLOGIES Speaker: Taehoon Kim, Seoul National University, KR Authors: Taehoon Kim¹, Minjeong Kim¹, Hankyu Chi², Byungjun Kang², Eunji Song² and Woo-Seok Choi¹ ¹Seoul National University, KR; ²SK hynix, KR Abstract The chip industry is undergoing a market transition from mass production to mass customization. Rapid market changes require agile responses and diversified product designs, particularly in interface circuits managing chip-to-chip communication. To facilitate these shifts, this paper proposes a machine learning-based method for rapidly and accurately predicting and analyzing the performance of high-speed transceivers, along with an evaluation methodology utilizing the proposed approach. Especially, using the process technology information as input in the dataset, this is the first work to predict the performance of a design across different technologies, which will be invaluable in architecting and optimizing designs during the early stages of development. By simulating each functional block, we gather a dataset for parameterized design and performance and incorporate device characteristics from lookup tables. The transmitter, which operates like digital circuits, is trained using parameterized signals with a DNN, while the receiver, containing analog blocks and feedback structures, employs hybrid LSTM-DNN learning with time-series input and output. Our model, trained with a 40nm design, demonstrates high accuracy in predicting performance even with different foundries and technologies. The majority of performance parameters show an R^2 value exceeding 0.9, indicating strong predictive accuracy under varying conditions. This method provides valuable insights for early-stage design optimization and process technology scaling, offering potential for broader applications in other circuit design areas.
14:45 CEST	TS26.10	ACCELERATING OTA CIRCUIT DESIGN: TRANSISTOR SIZING BASED ON A TRANSFORMER MODEL AND PRECOMPUTED LOOKUP TABLES Speaker: Subhadip Ghosh, University of Minnesota, US Authors: Subhadip Ghosh¹, Endalk Gebru¹, chandramouli kashyap², Ramesh Harjani¹ and Sachin S. Sapatnekar¹ ¹University of Minnesota, US; ²Cadence Design Systems, US Abstract Device sizing is crucial for meeting performance specifications in operational transconductance amplifiers (OTAs), and this work proposes an automated sizing framework based on a transformer model. The approach first leverages the driving-point signal flow graph (DP-SFG) to map an OTA circuit and its specifications into transformer-friendly sequential data. A specialized tokenization approach is applied to the sequential data to expedite the training of the transformer on a diverse range of OTA topologies, under multiple specifications. Under specific performance constraints, the trained transformer model is used to accurately predict DP-SFG parameters in the inference phase. The predicted DP-SFG parameters are then translated to transistor sizes using a precomputed look-up table-based approach inspired by the gm/Id methodology. In contrast to previous conventional or machine-learning-based methods, the proposed framework achieves significant improvements in both speed and computational efficiency by reducing the need for expensive SPICE simulations within the optimization loop; instead, almost all SPICE simulations are confined to the one-time training phase. The method is validated on a variety of unseen specifications, and the sizing solution demonstrates over 90% success in meeting specifications with just one SPICE simulation for validation, and 100% success with 3-5 additional SPICE simulations.
14:50 CEST	TS26.11	A 10PS-ORDER FLEXIBLE RESOLUTION TIME-TO-DIGITAL CONVERTER WITH LINEARITY CALIBRATION AND LEGACY FPGA Speaker: Kentaroh Katoh, Fukuoka University, JP Authors: Kentaroh Katoh¹, Toru Nakura² and Haruo Kobayashi³ ¹FUKUOKA UNIVERSITY, JP; ²Fukuoka University, JP; ³Gunma University, JP Abstract This paper presents a 10ps-order flexible resolution time-to-digital converter (TDC) consisting of only Lookup Tables and Flip-Flops that can be applied to legacy FPGAs, which is industry friendly. The proposed TDC is a Vernier delay-line based TDC. By using MUX chains as the delay adjustable buffers, it realizes flexible and high resolution 10ps-order TDC. By controlling the control values of each MUX chain independently, the nonlinearity of TDC is compensated. In the evaluation using the AMD Artix-7 FPGA, the DNL and INL were [-0.26 LSB, 0.91 LSB] and [-0.84 LSB, 2.27 LSB], respectively, at a resolution of 8.92 ps.

TS27 Design for On-Chip Interconnects

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: Rhône 2

Session chair:
Cedric Killian, Université de Saint Etienne, FR

Session co-chair:
Romain Lemaire, CEA, FR

Time	Label	Presentation Title Authors
14:00 CEST	TS27.1	HIPERNOC: A HIGH-PERFORMANCE NETWORK-ON-CHIP FOR FLEXIBLE AND SCALABLE FPGA-BASED SMARTNICS Speaker: Klajd Zyla, TU Munich, DE Authors: Klajd Zyla, Marco Liess, Thomas Wild and Andreas Herkersdorf, TU Munich, DE Abstract A recent approach that the research community has proposed to address the steep growth of network traffic and the attendant rise in computing demands is in-network computing. This paradigm shift is bringing about an increase in the types of computations performed by network devices. Consequently, processing demands are becoming more varied, requiring flexible packet-processing architectures. State-of-the-art switch-based smart network interface cards (SmartNICs) provide high versatility without sacrificing performance but do not scale well concerning resource usage. In this paper, we introduce HiPerNoC—a flexible and scalable field-programmable gate array (FPGA)-based SmartNIC architecture deploying a 2D-mesh network-on-chip (NoC) with a novel router design to manage network traffic with diverse processing demands. The NoC can forward incoming network packets to the available processing engines in the required sequence at a traffic load of up to 91.1 Gbit/s (0.89 flit/node/cycle). Each router applies distributed switch allocation and avoids head-of-line blocking by deploying queues at the switch crosspoints of input-output connections used by the routing algorithm. It also prevents deadlocks by employing non-blocking virtual cut-through switching. We implemented a prototype of HiPerNoC as a 4x4 2D-mesh NoC in SystemVerilog and evaluated it with synthetic network traffic via cycle-accurate register-transfer level simulations in Vivado. The evaluation results show that HiPerNoC achieves up to 53% higher saturation throughput, occupies 53 % fewer lookup tables and block RAMs, and consumes 16 % less power on an Alveo U55C than ProNoC—a state-of-the-art FPGA-based NoC.
14:05 CEST	TS27.2	NEUROHEXA: A 2D/3D-SCALABLE MODEL-ADAPTIVE NOC ARCHITECTURE FOR NEUROMORPHIC COMPUTING Speaker: Yi Zhong, Peking University, CN Authors: Yi Zhong, Zilin Wang, Yipeng Gao, Xiaoxin Cui, Xing Zhang and Yuan Wang, Peking University, CN Abstract Neuromorphic computing has endeavored a novel computing paradigm that entails a bio-inspired architecture to reproduce the remarkable functionalities of the human brain, such as massively parallel processing and extremely low-power consumption. However, those promising merits can be greatly canceled by the mismatched communication infrastructure in large-scale hardware implementation, in view of the vast degree of neural connectivity, the unstructured spike dataflow, and the unbalanced model workload assignment. In an effort to tackle those challenges, this work presents NeuroHexa, a network-on-chip (NoC) architecture intended for multi-core neuromorphic design. NeuroHexa adopts a customized intra-chip hexagonal topology, which can be further cascaded in 6 directions by either 2D or 3D chiplet integration. Designed in globally asynchronous, locally synchronous (GALS) methodology, a group of processing nodes can operate in independent work pace to further improve resource utilization. To satisfy the varied requirement of data reuse across the chip, NeuroHexa proposes a flexible multicast routing mechanism to best adapt to the model-defined dataflow. And under a specific congestion scenario, NeuroHexa can switch its routing algorithm between deterministic routing and fully adaptive routing modes. The presented NoC router is evaluated in 28nm CMOS, where we achieve the maximal throughput as 179.2Gbps, and the best energy efficiency as 4.872pJ/packet at the area overhead of 0.0226mm2.
14:10 CEST	TS27.3	SPB: TOWARDS LOW-LATENCY CXL MEMORY VIA SPECULATIVE PROTOCOL BYPASSING Speaker: Junbum Park, Sungkyunkwan University, KR Authors: Junbum Park, Yongho Lee, Sungbin Jang, Wonyoung Lee and Seokin Hong, Sungkyunkwan University, KR Abstract Compute Express Link (CXL) is an advanced interconnect standard designed to facilitate high-speed communication between CPUs, accelerators, and memory devices, making it well-suited for data-intensive applications such as machine learning and real-time analytics. Despite its advantages, CXL memory encounter significant latency challenges due to the complex hierarchy of protocol layers, which can adversely impact performance in latency-sensitive scenarios. To address this issue, we introduce the Speculative Protocol Bypassing (SPB) architecture, which aims to minimize latency during read operations by speculatively bypassing several protocol layers of CXL. To achieve this, SPB employs the Snooper mechanism, which extracts essential read commands from the Flit data at an early stage, allowing it to bypass multiple protocol layers and reduce memory access time. Additionally, the Hazard Filter (HF) prevents Read-After-Write (RAW) hazards between read and write operations, thereby maintaining data integrity and ensuring system reliability. The SPB architecture effectively optimizes CXL memory access latency, providing a robust solution for high-performance computing environments that require both low latency and high efficiency. Its minimal hardware overhead makes it a practical and scalable enhancement for future CXL-based memory.
14:15 CEST	TS27.4	SRING: A SUB-RING CONSTRUCTION METHOD FOR APPLICATION-SPECIFIC WAVELENGTH-ROUTED OPTICAL NOCS Speaker: Zhidan Zheng, TU Munich, DE Authors: Zhidan Zheng, Meng Lian, Mengchu Li, Tsun-Ming Tseng and Ulf Schlichtmann, TU Munich, DE Abstract Wavelength-routed optical networks-on-chip (WRONoCs) attract ever-increasing attention for supporting high-speed communications with low power and latency. Among all WRONoC routers, optical ring routers attract much interest for their simple structures. However, current designs of ring routers have overlooked the customization problem: when adapting to applications that have specific communication requirements, current designs suffer high propagation loss caused by long worst-case signal paths and high splitter usage in power distribution networks (PDN). To address those problems, we propose a novel customization method to generate application-specific ring routers with multiple sub-rings, SRing. Instead of sequentially connecting all nodes in a large ring, we cluster the nodes and connect them with sub-ring waveguides to reduce the path length. Besides, we propose a mixed integer linear programming model for wavelength assignment to reduce the number of PDN splitters. We compare SRing to three state-of-the-art ring router design methods for six applications. Experimental results show that SRing can greatly reduce the length of the longest signal path, the worst-case insertion loss, and the number of splitters in the PDN, significantly improving the power efficiency.
14:20 CEST	TS27.5	BEAM: A MULTI-CHANNEL OPTICAL INTERCONNECT FOR MULTI-GPU SYSTEMS Speaker: Chongyi Yang, Microelectronics Thrust, The Hong Kong University of Science and Technology (Guangzhou), CN Authors: Chongyi Yang¹, Bohan Hu¹, Peiyu Chen¹, Yinyi Liu², Wei Zhang² and Jiang Xu¹ ¹The Hong Kong University of Science and Technology (Guangzhou), CN; ²The Hong Kong University of Science and Technology, HK Abstract High-performance computing and AI applications necessitate high-bandwidth communication between GPUs. Traditional electrical interconnects for GPU-to-GPU communication face challenges over longer distances, including high power consumption, crosstalk noise, and signal loss. In contrast, optical interconnects excel in this domain, offering high bandwidth and consistent power dissipation over long distance. This paper proposes BEAM, a Bandwidth-Enhanced optical interconnect Architecture for Multi-GPU systems. BEAM extends electrical-optical interfaces into the GPU package, positioning them close to GPU compute logic and memory. Unlike existing single-channel approaches, each BEAM optical interface incorporates multiple parallel optical channels, further enhancing bandwidth. An arbitration scheme manages channel usage among data transfers. Evaluation on Rodinia benchmarks and LLM training kernels demonstrates that BEAM achieves a speedup of 1.14 - 1.9× and reduces energy consumption by 29 - 44% compared to the electrical-interconnect system and state-of-the-art schemes, while maintaining comparable chip area consumption.
14:25 CEST	TS27.6	TCDM BURST ACCESS: BREAKING THE BANDWIDTH BARRIER IN SHARED-L1 RVV CLUSTERS BEYOND 1000 FPUS Speaker: Diyou Shen, ETH Zurich, CH Authors: Diyou Shen¹, Yichao Zhang¹, Marco Bertuletti¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract As computing demand and memory footprint of deep learning applications accelerate, clusters of cores sharing local (L1) multi-banked memory are widely used as key building blocks in large-scale architectures. When the cluster's core count increases, a flat all-to-all interconnect between cores and L1 memory banks becomes a physical implementation bottleneck, and hierarchical network topologies are required. However, hierarchical, multi-level intra-cluster networks are subject to internal contention which may lead to significant performance degradation, especially for SIMD or vector cores, as their memory access is bursty. We present the TCDM Burst Access architecture, a software-transparent burst transaction support to improve bandwidth utilization in clusters with many vector cores tightly coupled to a multi-banked L1 data memory. In our solution, a Burst Manager dispatches burst requests to L1 memory banks, multiple 32b words from burst responses are retired in parallel on channels with parametric data-width. We validate our design on a RISC-V Vector (RVV) many-core cluster, evaluating the benefits on different core counts. With minimal logic area overhead (less than 8%), we improve the bandwidth of a 16-, a 256-, and a 1024-Floating Point Unit (FPU) baseline clusters, without Tightly Coupled Data Memory (TCDM) Burst Access, by 118%, 226%, and 77% respectively. Reaching up to 80% of the cores-memory peak bandwidth, our design demonstrates ultra-high bandwidth utilization and enables efficient performance scaling. Implemented in 12-nm FinFET technology node, compared to the serialized access baseline, our solution achieves up to 1.9x energy efficiency and 2.76x performance in real-world kernel benchmarkings.
14:30 CEST	TS27.7	SEDG: STITCH-COMPATIBLE END-TO-END LAYOUT DECOMPOSITION BASED ON GRAPH NEURAL NETWORK Speaker: Yifan Guo, Shanghai Jiao Tong university, CN Authors: Yifan Guo¹, Jiawei Chen¹, Yexin Li¹, Yunxiang Zhang¹, Qing Zhang¹, Yuhang Zhang² and Yongfu Li¹ ¹Shanghai Jiao Tong University, CN; ²East China Normal University, CN Abstract Advanced semiconductor lithography faces significant challenges as feature sizes continue to shrink, necessitating effective Multiple Patterning Layout Decomposition (MPLD) algorithms. Existing MPLD algorithms have limited efficiency or cannot support stitch insertion to achieve finer-grained optimal decomposition. This paper introduces an end-to-end GNN-based framework that not only achieves high-quality solutions quickly but also applies to layouts with stitches. Our framework treats layouts as heterogeneous graphs and performs inference through a message-passing mechanism. We deliver ultra-competitive, near-optimal solutions that are 10× faster than the exact algorithm (e.g., integer linear programming) and 3× faster than approximate algorithms (e.g., exact-cover, semi-definite programming).
14:35 CEST	TS27.8	MULTISCALE FEATURE ATTENTION AND TRANSFORMER BASED CONGESTION PREDICTION FOR ROUTABILITY-DRIVEN FPGA MACRO PLACEMENT Speaker: Hao Gu, Southeast University, CN Authors: Hao Gu¹, Xinglin Zheng¹, Youwen Wang¹, Keyu Peng¹, Ziran Zhu² and Yang Jun³ ¹Southeast University, CN; ²School of Integrated Circuits, Southeast University, CN; ³, Abstract As routability has emerged as a critical task in modern field-programmable gate array (FPGA) physical design, it is desirable to develop an effective congestion prediction model during the placement stage. Given that the interconnection congestion level is a critical metric for measuring the routability of FPGA placement, we utilize that level as the model training label. In this paper, we propose a multiscale feature attention (MFA) and transformer based congestion prediction model to extract placement features and strengthen their association with congested areas for effective FPGA macro placement. A convolutional neural network (CNN) component is first designed to extract multiscale features from grid-based placement. Then, a well-designed MFA block is proposed that utilizes the dual attention mechanism on both spatial and channel dimensions to enhance the representation of each multiscale feature. By incorporating MFA blocks and CNN's output at each skip connection layer, our model substantially enhances its capability to learn features and recover more precise congestion level maps. Furthermore, multiple transformer layers that employ dynamic attention mechanisms are utilized to extract global information, which can significantly improve the difference between various congestion levels and enhance the ability to identify these levels. Based on the ten most congested and challenging benchmarks from the MLCAD 2023 FPGA macro placement contest, experimental results show that our model outperforms existing congestion prediction models. Furthermore, our model can achieve the best routability and score among the contest winners when integrated into the macro placer based on DREAMPlaceFPGA.
14:40 CEST	TS27.9	AN EFFECTIVE AND EFFICIENT CROSS-LINK INSERTION FOR NON-TREE CLOCK NETWORK SYNTHESIS Speaker: Mengshi Gong, Southwest University of Science and Technology, CN Authors: Jinghao Ding¹, Jiazhi Wen¹, Hao Tang¹, Zhaoqi Fu¹, Mengshi Gong¹, Yuanrui Qi¹, Wenxin Yu¹ and Jinjia Zhou² ¹Southwest University of Science and Technology, CN; ²Hosei University, JP Abstract Clock skew introduces significant challenge to the overall system performance. Existing non-tree solutions like cross-link insertion often come with limitations, such as the over-consumption of resource and power. In this work, we propose a cross-link insertion algorithm that effectively reduces the clock skew with minimal power overhead, and prioritize delay optimization on the paths with high sensitivity to the skew. The experimental results from the ISPD 2010 benchmarks show a 17% reduction in the mean of clock skew, a 45% decrease in the standard deviation of clock skew, and a 13% lower power consumption versus the advanced non-tree solutions in literature.

US01 Unplugged session

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:30 CEST
Location / Room: St Clair 1

Organisers:
Hans Vangheluwe, University of Antwerp and Flanders Make, BE
Pieter J. Mosterman, CNH Industrial, NL

In this session we will brainstorm in small groups about Unconventional Computing by any of a wide range of new or unusual methods. Examples are analog mechanical, analog electrical, biological, membrane, optical, fluidics, and micromechanical computing. We will learn about different examples of unconventional computing (in short, informal presentations by the participants) and subsequently explore the commonalities and foundations of these approaches. At the heart of unconventional computing as an approach are abstraction and analogy while the value (performance, cost, energy consumption) is its restriction to a particular class of problems. For example, (electrical) analog computing excels at solving differential equations. We will try to chart the boundaries of operation where value is created.

W03 3rd Workshop on Nano Security: From Nano-Electronics to Secure Systems

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 18:00 CEST
Location / Room: St Clair 2

Organisers:
Ilia Polian, University of Stuttgart, DE
Nan Du, Friedrich Schiller University Jena, Germany, DE
Shahar Kvatinsky, Technion – Israel Institute of Technology, IL
Yunsi Fey, Northeastern University, Boston, US
Fareena Saqib, University of North Carolina, Charlotte, US

Speaker:
Damian Dudek, Information Technology Society in the Association for Electrical, Electronic & Information Technologies (VDE), DE

Keynote Speaker:
Massimiliano di Ventra, University of California, San Diego, US

Today’s societies critically depend on electronic systems. Security of such systems are facing completely new challenges due to the ongoing transition to radically new types of nano-electronic devices, such as memristors, spintronics, or carbon nanotubes. The use of such emerging nano-technologies is inevitable to address the essential needs related to energy efficiency, computing power and performance. Therefore, the entire industry are switching to emerging nano-electronics alongside scaled CMOS technologies in heterogeneous integrated systems. These technologies come with new properties and also facilitate the development of radically different computer architectures.

The third edition of the NanoSec workshop will bring together researchers from hardware-oriented security and from emerging hardware technology. It will explore the potential of new technologies and architectures to provide new opportunities for achieving security targets, but it will also raise questions about their vulnerabilities to new types of hardware-oriented attacks. The workshop is based on a Priority Program https://spp-nanosecurity.uni-stuttgart.de/ funded since 2019 by the German DFG, and will be open to members and non-members of that Priority Program alike.

W03.1 Keynote

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 14:00 CEST - 15:00 CEST
Location / Room: St Clair 2

Session chair:
Ilia Polian, University of Stuttgart, DE

Keynote Speaker:
Massimiliano di Ventra, University of California, San Diego, US

Time	Label	Presentation Title Authors
14:00 CEST	W03.1.1	MEMCOMPUTING AND IMPLICATIONS FOR SECURITY Keynote Speaker: Massimiliano di Ventra, University of California, San Diego, US

W03.2 Secure Emerging Architectures and Technologies

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 15:00 CEST - 16:00 CEST
Location / Room: St Clair 2

Session chair:
Francesco Regazzoni, ALaRI, CH

An Obfuscated 2-bit Adder/Half-Subtract Circuit with Reconfigurable Field Effect Transistors
Giulio Galderisi¹, Niladri Bhattacharjee¹, Marc Wijvliet², Shubham Rai², Akash Kumar³, Thomas Mikolajick^1,4, Jens Trommer¹
¹NaMLab gGmbH, Dresden, Germany
²Chair of Processor Design, TU Dresden, Germany
³Chair of Embedded Systems, Ruhr University Bochum, Germany
⁴Chair of Nanoelectronics, TU Dresden, Germany

Designing Memory Protection for a RISC-V Nano-VP
Spandan Das, Christoph Lüth, Rolf Drechsler
Dept. Mathematics and Informatics, University of Bremen, Bremen, Germany

In-and-Beyond Boundaries: GPIO Signaling Research and Use-Cases with TrustZone
Christian Niesler¹, Markus Ströhnisch¹, Moritz Peters², Tim Güneysu², Lucas Davi¹
¹University of Duisburg-Essen, Essen, Germany
²Ruhr University Bochum, Germany

W03.3 Coffee Break and Posters

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:00 CEST - 17:00 CEST
Location / Room: St Clair 2

Session chair:
Ilia Polian, University of Stuttgart, DE

OnE-Secure: Securing State-of-the-Art Chips Against High-Resolution Contactless Optical and Electron-Beam Probing Attacks
Sebastian Brand (FhG IMWS), Rolf Drechsler (U Bremen), Jean-Pierre Seifert TU Berlin), Frank Sill Torres (DLR)

STAMPS-PLUS: Exploration of an integrated Strain-based TAMPer Sensor for Puf and trng concepts with best-in-class Leakage resilience and robUStness
Ralf Brederlow (TU Munich), Matthias Hiller (FhG AISEC), Michael Pehl (TU Munich)

RAINCOAT: Randomization in Secure Nano-Scale Microarchitectures 2
Lucas Davi (U Duisburg-Essen), Tim Güneysu (RU Bochum)

EMBOSOM: Embedded Software Security into Modern Emerging Hardware Paradigms
Rolf Drechsler (U Bremen), Tim Güneysu (RU Bochum), Pascal Sasdrich (RU Bochum), Christoph Lüth (U Bremen)

MemCrypto: Towards Secure Electroforming-free Memristive Cryptographic Implementations
Nan Du (FSU Jena), Ilia Polian (U Stuttgart)

HaSPro: Verifiable Hardware Security for Out-of-Order Processors
Thomas Eisenbarth (U Lübeck), Wolfgang Kunz (TU Kaiserslautern)

NanoSec2: Nanomaterial-based platform electronics for PUF circuits with extended entropy sources
Sascha Herrmann (TU Chemnitz), Stefan Katzenbeisser (U Passau), Elif Kavun (U Passau)

SecuReFET: Secure Circuits through Inherent Reconfigurable FET
Akash Kumar (TU Dresden), Thomas Mikolajick (NaMLab GmbH)

SSIMA: Scalable Side-Channel Immune Micro-Architecture
Amir Moradi (TU Darmstadt)

SeMSiNN: Secure Mixed-SIgnal Neural Networks
Maurits Ortmanns (U Ulm), Ilia Polian (U Stuttgart)

W03.4 Invited Talk Damian Dudek

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 17:00 CEST - 17:20 CEST
Location / Room: St Clair 2

Session chair:
Jens Trommer, NaMLab gGmbH, Dresden, DE

Time	Label	Presentation Title Authors
17:00 CEST	W03.4.1	A SHALLOW VIEW ON HARDWARE LOCKING SYSTEMS BEYOND DIGITAL. FROM RESEARCH PERSPECTIVE TO METHODS AND PRACTICAL TOOLS Speaker: Damian Dudek, Information Technology Society in the Association for Electrical, Electronic & Information Technologies (VDE), DE

W03.5 Session 2: Physical Attacks

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 17:20 CEST - 18:00 CEST
Location / Room: St Clair 2

Session chair:
Nan Du, Friedrich Schiller University Jena, Germany, DE

Distinguishability between Multiplication and Squaring Operations: a New Marker
Alkistis Aikaterini Sigourou¹, Zoya Dyka^1,2, Peter Langendoerfer^1,2, Ievgen Kabin¹
¹IHP – Leibniz-Institut für innovative Mikroelektronik, Frankfurt (Oder), Germany
²BTU Cottbus-Senftenberg, Cottbus, Germany

SCA Test Results Depend on the Measurement Equipment: Riscure vs. Teledyne LeCroy
Dmytro Petryk¹, Zoya Dyka^1,2, Peter Langendoerfer^1,2, Ievgen Kabin¹
¹IHP – Leibniz-Institut für innovative Mikroelektronik, Frankfurt (Oder), Germany
²BTU Cottbus-Senftenberg, Cottbus, Germany

LBR02 Late Breaking Results on Innovative Dependable Solutions and Advanced Microarchitectures

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Rhône 1

Session chair:
Pascal Vivet, CEA, Fr

Session co-chair:
Annachiara Ruospo, Politecnico di Torino, It

Time	Label	Presentation Title Authors
16:30 CEST	LBR02.1	LATE BREAKING RESULTS: THERMAL FEASIBILITY OF BACKSIDE INTEGRATED LDOS IN 2.5D/3D SYSTEM-IN-PACKAGE USING NANOSHEET TECHNOLOGY Speaker: Yukai Chen, IMEC, BE Authors: Yukai Chen¹, Subrat Mishra¹, Julien Ryckaert¹, Dwaipayan Biswas² and James Myers¹ ¹IMEC, BE; ²imec, BE Abstract Digital Low Dropout Regulators (LDOs) are an excellent candidate for area-efficient fine-grain power management in heterogeneous systems, leveraging integrated power switches. Relocating the power switches to the backside of the wafer in conjunction with the Backside Power Delivery Network (BSPDN) layer is envisaged as a System Technology Co-Optimization (STCO) booster for finer grain power management and reduced area/cost. We perform a detailed thermal analysis of using power-switch based LDOs enabling per-core DVFS for a high-performance server 3D computing chiplet in a Nanosheet CMOS (A10) technology node with BSPDN. While BSPDN introduces thermal penalties due to a lack of lateral heat spreading, our high-resolution thermal simulations explore the feasibility of moving LDOs to the backside. Increasing the LDO area from 5% to 50% of the backside die area effectively lowers the 2.5/3D System-in-Package (SiP) peak temperature, confirming that thermal concerns do not impede backside LDO integration when spread sufficiently. This study supports the cost-effective design of next-generation SiPs by demonstrating no adverse thermal impact for relocating power switches (realized as LDOs) to the wafer backside in the advanced nanosheet era.
16:33 CEST	LBR02.2	LATE BREAKING RESULTS: PHYSICAL CO-DESIGN FOR FIELD-COUPLED NANOCOMPUTING Speaker: Simon Hofmann, Chair for Design Automation, TU Munich, DE Authors: Simon Hofmann, Marcel Walter and Robert Wille, TU Munich, DE Abstract Field-coupled Nanocomputing (FCN), a class of post-CMOS technologies operating at the nanoscale without the flow of electricity, is becoming a reality due to advancements in simulating and manufacturing logic gates using Silicon Dangling Bonds (SiDBs). Efficient physical design methodologies are crucial for the performance, area efficiency, reliability, and manufacturability of FCN circuits. However, despite considerable progress in developing algorithms and tools tailored to FCN physical design, achieving efficient results still requires a co-design approach, necessitating expert manual refinement similar to the CMOS design process. To this end, we introduce a GUI-based tool that combines both automation and expert adjustments, enabling designers to easily optimize and modify FCN layouts. To demonstrate its potential, a designer used the tool to reduce the area of the best-known layout for the benchmark circuit cm82a by over 15% in less than a minute. Additionally, the tool is publicly available as open-source at https://github.com/cda-tum/mnt-designer.
16:36 CEST	LBR02.3	LATE BREAKING RESULTS: A RISC-V ISA EXTENSION FOR CHAINING IN SCALAR PROCESSORS Speaker: Luca Colagrande, ETH Zurich, CH Authors: Luca Colagrande¹, Jayanth Jonnalagadda¹ and Luca Benini² ¹ETH Zurich, CH; ²ETH Zurich, CH \| Università di Bologna, IT Abstract Modern general-purpose accelerators integrate a large number of programmable area- and energy-efficient processing elements (PEs), to deliver high performance while meeting stringent power delivery and thermal dissipation constraints. In this context, PEs are often implemented by scalar in-order cores, which are highly sensitive to pipeline stalls. Traditional software techniques, such as loop unrolling, mitigate the issue at the cost of increased register pressure, limiting flexibility. We propose scalar chaining, a novel hardware-software solution, to address this issue without incurring the drawbacks of traditional software-only techniques. We demonstrate our solution on register-limited stencil codes, achieving >93% FPU utilizations and a 4% speedup and 10% higher energy efficiency, on average, over highly-optimized baselines. Our implementation is fully open source and performance experiments are reproducible using free software.
16:39 CEST	LBR02.4	LATE BREAKING RESULTS: IS RECONFIGURABLE-BASED OBFUSCATION SECURE? Speaker: Levent Aksoy, Department of Computer Systems, Tallinn University of Technology, EE Authors: Zain Ul Abideen¹, Samuel Pagliarini² and Levent Aksoy³ ¹Carnegie Mellon University (CMU), US; ²Carnegie Mellon University, US; ³Tallinn University of Technology, EE Abstract Reconfigurable-based obfuscation (REbo) techniques, such as eFPGA redaction, offer security against threats present in the globalized Integrated Circuit (IC) supply chain. Today, no attacks have succeeded in convincingly or fully breaking these techniques. At best, previous attacks have provided vulnerability analysis or have partially recovered a key (bitstream). This paper presents a novel attack to break the security of REbo. We propose a new attack to retrieve the design's bitstream and assess the effectiveness of the attack using the HeLLO CTF benchmarks. The success rate of our attack is between 57% and 62%, superseding all previous known results on these benchmarks.
16:42 CEST	LBR02.5	LATE BREAKING RESULTS: SOC-FPGA HW TROJAN LEAKING DATA THROUGH EM COVERT CHANNEL Speaker: Marie-Aïnhoa Nicolas, IETR - Université de Rennes, FR Authors: Marie-Aïnhoa Nicolas, Jordane Lorandel and Christophe Moy, IETR, FR Abstract This paper demonstrates an attack exploiting an Electromagnetic (EM) leak coming from the SoC-FPGA I/O. A covert channel is created by a dedicated Hardware Trojan controlling the EM emanations between the DDR3L SDRAM and the SoC-FPGA, exfiltrating sensitive data.
16:45 CEST	LBR02.6	LATE BREAKING RESULTS: AUTOMATIC ANOMALY DETECTION METHOD IN PHYSICAL UNCLONABLE FUNCTIONS USING DATA MINING TECHNIQUES Speaker: Mohammad Reza Heidari Iman, University Grenoble Alpes, CNRS, Grenoble INP, TIMA Laboratory, FR Authors: Mohammad Reza Heidari Iman¹, Sergio Vinagrero Gutierrez², Elena-Ioana Vatajelu¹ and Giorgio Di Natale¹ ¹University Grenoble Alpes, CNRS, Grenoble INP, TIMA, FR; ²Institut des Nanotechnologies de Lyon (INL), FR Abstract Physical Unclonable Functions (PUFs) present a promising alternative to traditional cryptographic techniques for securing sensitive information in modern circuits. By exploiting inherent process variability, PUFs generate unique secrets dynamically, thus eliminating the need for data storage. However, a major challenge in PUF-based security is distinguishing valid PUFs from those that may have been tampered with or are invalid (i.e., not belonging to the original design). This paper proposes a data mining-based approach for detecting anomalies and identifying tampered or invalid PUFs. The proposed method mines a set of rules that describe the expected behavior of the PUF, with deviations from these rules signaling potential security issues and vulnerabilities. Experimental results demonstrate that the method effectively identifies invalid or tampered PUFs, showcasing its potential for enhancing PUF-based security systems.
16:48 CEST	LBR02.7	LATE BREAKING RESULT: FPGA-BASED EMULATION AND FAULT INJECTION FOR CNN INFERENCE ACCELERATORS Speaker: Vojtech Mrazek, Brno University of Technology, CZ Authors: Filip Masar, Vojtech Mrazek and Lukas Sekanina, Brno University of Technology, CZ Abstract A new field programmable gate array (FPGA)-based emulation platform is proposed to accelerate fault tolerance analysis of inference accelerators of convolutional neural networks (CNN). For a given CNN model, hardware accelerator architecture, and FT analysis target, an FPGA-based CNN implementation is generated (with the help of the Tengine framework), and fault injection logic is added. In our first case study, we report how the classification accuracy drop depends on the faults injected into multipliers used in Multiply-and-Accumulate Units of NVDLA inference accelerator executing ResNet-18 CNN. The FT analysis emulated on Zynq UltraScale+ SoC is an order of magnitude faster than software emulation.
16:51 CEST	LBR02.8	LATE BREAKING RESULTS: A DATA COMPACTION STRATEGY FOR EXTENSIVE TEST FLOWS OF LARGE VOLUMES OF SOCS MEMORIES Speaker: Giorgio Insinga, Politecnico di Torino, IT Authors: Paolo Bernardi¹, Bernard Borio¹, Giorgio Insinga¹, Brendon Mendicino¹, Matteo Battilana², Matteo Coppetta², Nellina Mautone², Felix Tengler², Pierre Scaramuzza² and Rudolf Ullmann² ¹Politecnico di Torino, IT; ²Infineon Technologies, DE Abstract Embedded memories are an essential component of modern System-on-Chips (SoCs). As memory requirements are constantly increasing, embedded memories occupy a significant percentage of the die area and are one of the main contributors to the yield of the devices. Manufacturers must conduct thorough testing to assess the reliability of their products, particularly in safety-critical environments like automotive applications. A typical automotive-grade memory test flow comprises several tests under varying conditions, including temperature and operating frequency. The SoC under test executes these tests, and they typically generate an extensive amount of diagnostic data that needs to be exported to the external world in time-consuming communications with the external testers. The computationally easiest way to encode the diagnostic data is through a list-based method, in which each single fault is logged individually, but is not particularly efficient with huge number of faults. This paper presents an optimized on-chip fault encoding algorithm that combines an efficient fault shape encoding method with only the encoding of the differences between each test. Experimental results collected in a simulation environment, where 10,000 devices were modeled to realistically represent fault evolution between consecutive memory tests, demonstrate an average reduction of 66.99% in diagnostic memory space requirements compared to previous State-Of-The-Art (SOTA) solutions.

MPP03 Multi-Partner Projects

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Roseraie 1&2

Session chair:
Marko Andjelkovic, IHP - Leibniz Institute for High Performance Microelectronics, DE

Session co-chair:
Maksim Jenihhin, TalTech, EE

Time	Label	Presentation Title Authors
16:30 CEST	MPP03.1	MULTI-PARTNER PROJECT: ORCHESTRATING DEPLOYMENT AND REAL-TIME MONITORING - NEPHELE MULTI-CLOUD ECOSYSTEM Speaker: Manolis Katsaragakis, National TU Athens, GR Authors: Manolis Katsaragakis¹, Orfeas Filippopoulos¹, Christos Sad², Dimosthenis Masouros¹, Dimitrios Spatharakis¹, Ioannis Dimolitsas¹, Nikos Filinis¹, Anastasios Zafeiropoulos¹, Kostas Siozios³, Dimitrios Soudris¹ and Symeon Papavassiliou¹ ¹National TU Athens, GR; ²Department of Physics, Aristotle University of Thessaloniki,, GR; ³Aristotle University of Thessaloniki , GR Abstract The rapid growth of Internet of Things (IoT) devices and emerging technologies, along with the growing demands of edge-deployed applications, has led to a complex paradigm where computation often shifts dynamically acrooss the IoT- Edge-Cloud continuum. The NEPHELE project addresses these complexities by enabling seamless orchestration across a diverse spectrum of computing resources, spanning multi-cloud environ- ments to the far-Edge. In this paper, we present NEPHELE's multi-cloud infrastructure, built to overcome key orchestration challenges within cloud and edge environments. We discuss the core components and architectural decisions, focusing on multi- cluster resource orchestration mechanisms, integrated monitoring for local and multi-cloud systems, inter- and intra-cluster scaling, and networking capabilities. Experimental results demonstrate the efficiency of our infrastructure, highlighting overhead manage- ment in service deployment, migration, networking, and scaling scenarios, thus highlighting NEPHELE's robustness in handling distributed applications across heterogeneous environments
16:35 CEST	MPP03.2	MULTI-PARTNER PROJECT: CYBERSECDOME - FRAMEWORK FOR SECURE, COLLABORATIVE, AND PRIVACY-AWARE INCIDENT HANDLING FOR DIGITAL INFRASTRUCTURE Speaker: Mohammad Hamad, TU Munich, DE Authors: Mohammad Hamad¹, Michael Kuehr¹, Haralambos Mouratidis², Eleni-Maria Kalogeraki², Christos Gizelis³, Dimitris Papanikas³, Athanasios Bountioukos-Spinaris⁴, Charilaos Skandylas⁵, Evangelos Raptis⁶, Andreas Alexopoulos⁶, Grigorios Chrysos⁷, Mina Marmpena⁸, Sevasti Politi⁸, Konstantinos Lieros⁸, Papagiannopoulos Nikolaos⁹, Iordanis Xanthopoulos¹⁰, Spyros Papastergiou¹¹, Sotiris Ioannidis⁷, Mikael Asplund¹², Marc-Oliver Pahl¹³ and Sebastian Steinhorst¹ ¹TU Munich, DE; ²Security Labs Consulting, IE; ³Hellenic Telecommunications Organisation, GR; ⁴CyberAlytics Ltd., CY; ⁵Link�ping University, SE; ⁶Aegis IT Research, DE; ⁷TU Crete, GR; ⁸Information Technology for Market Leadership, GR; ⁹Athens International Airport S.A., GR; ¹⁰Sphynx, SZ; ¹¹MAGGIOLI S.P.A., IT; ¹²Linköping University, SE; ¹³IMT Atlantique, FR Abstract Digital infrastructure is vital for the economy, democracy, and everyday life, yet it is becoming increasingly vulnerable to strategic cyber-attacks. These attacks can lead to significant digital disruptions, resulting in widespread service outages, financial losses, and a decline in public trust. Ensuring resilience is difficult due to the infrastructure's complexity, the large volume of data involved, and the growing need for quick, coordinated responses. In the EU Horizon project CyberSecDome, we propose a multi-layered framework that provides AI-driven solutions for incident detection and prediction, automated testing, risk assessment, and rapid incident response, supporting continuity amid complex, large-scale cyber threats. Additionally, CyberSecDome introduces a virtual reality interface to enhance AI model explainability and provide real-time contextual awareness of ongoing attacks and defense mechanisms. It also enables privacy-aware model sharing across AI systems, fostering secure collaboration among different systems.
16:40 CEST	MPP03.3	MULTI-PARTNER PROJECT: ARCHITECTURES AND DESIGN METHODOLOGIES TO ACCELERATE AI WORKLOADS - THE ICSC FLAGSHIP 2 PROJECT Speaker: Cristina Silvano, Politecnico di Milano, IT Authors: Cristina Silvano¹, Fabrizio Ferrandi¹, Serena Curzel¹, Daniele Ielmini², Stefania Perri³, Fanny Spagnolo⁴, Pasquale Corsonello⁵, Sebastiano Schifano⁶, Cristian Zambelli⁷, Angelo Garofalo⁸, Luca Benini⁹ and Francesco Conti¹⁰ ¹Politecnico di Milano, IT; ²Poltecnico di Milano, IT; ³University of Calabria - DIMEG, IT; ⁴DIMES, University of Calabria, IT; ⁵University of Calabria, IT; ⁶University of Ferrara, IT; ⁷University of Ferrara, IT; ⁸University of Bologna, ETH Zurich, IT; ⁹ETH Zurich, CH \| Università di Bologna, IT; ¹⁰Università di Bologna, IT Abstract Recent pre-exascale and exascale supercomputers have driven the development of increasingly sophisticated AI models for diverse applications, including image recognition and classification, natural language processing, and generative AI. These applications require specialized hardware accelerators, to handle the heavy computational demands of AI algorithms in an energy-efficient manner. Today, AI accelerators are deployed across various systems, from low-power edge devices to large-scale servers, high-performance computing (HPC) infrastructures, and data centers. The primary objective of the ICSC Flagship 2 project, discussed in this paper, is to develop heterogeneous hardware platforms optimized to accelerate HPC and big data applications. Specifically, this paper provides an overview of the key challenges addressed and the achievements realized at the current intermediate stage of the ICSC Flagship 2 project focused on architectures, technologies, and design methodologies to design efficient hardware accelerators for AI workloads, such as deep learning (DL) and transformer models.
16:45 CEST	MPP03.4	MULTI-PARTNER PROJECT: A DATA SPACES ARCHITECTURE FOR ENHANCING GREEN AI SERVICES (GREEN.DAT.AI) Speaker: Ioannis Chrysakis, Netcompany-Intrasoft SA, LU Authors: Ioannis Chrysakis¹, Evangelos Agorogiannis¹, Nikoleta Tsampanaki¹, Michalis Vourtzoumis¹, Eva Chondrodima², Yannis Theodoridis², Domen Mongus³, Ben Capper⁴, Martin Wagner⁵, Aris Sotiropoulos⁶, Fábio Coelho⁷, Claudia Brito⁸, Panos Protopapas⁹, Despina Brasinika⁹, Ioanna Fergadiotou⁹ and Christos Doulkeridis² ¹Netcompany-Intrasoft SA, LU; ²University of Piraeus, GR; ³University of Maribore, SI; ⁴Redhat, IE; ⁵Eviden, ES; ⁶AEGIS IT Research, DE; ⁷INESC TEC & Universidade do Minho, PT; ⁸INESC TEC, PT; ⁹Inlecom Innovation, GR Abstract The concept of data spaces has emerged as a structured, scalable solution to streamline and harmonize data sharing across established ecosystems. Simultaneously, the rise of AI services enhances the extraction of predictive insights, operational efficiency, and decision-making. Despite the potential of combining these two advancements, integration remains challenging: data spaces technology is still developing, and AI services require further refinement in areas like ML workflow orchestration and energy-efficient ML algorithms. In this paper, we introduce an integrated architectural framework, developed under the Green.Dat.AI project, that unifies the strengths of data spaces and AI to enable efficient, collaborative data sharing across sectors. A practical application is illustrated through a smart farming use case, showcasing how AI services within a data space can advance sustainable agricultural innovation. Integrating data spaces with AI services thus maximizes the value of decentralized data while enhancing efficiency through a powerful combination of data and AI capabilities.
16:50 CEST	MPP03.5	MULTI-PARTNER PROJECT: SAFE, SECURE AND DEPENDABLE MULTI-UAV SYSTEMS FOR SEARCH AND RESCUE OPERATIONS Speaker: Maria Michael, University of Cyprus, CY Authors: Panagiota Nikolaou¹, Antonis Savva², Ioannis Sorokos³, Koorosh Aslansefat⁴, Sondess Missaoui⁵, Akram Naveed³, Daniel Hillen³, Marc Lorenz³, Martin Walker⁴, Manos Papoutsakis⁶, Simos Gerasimou⁵, Panayiotis Kolios⁷, Yiannis Papadopoulos⁴, Jan Reich³, Sotiris Ioannidis⁶ and Maria Michael⁸ ¹University of Central Lancashire, CY; ²KIOS Research and Innovation Center of Excellence and the Department of Electrical and Computer Engineering, CY; ³Fraunhofer IESE, ; ⁴University of Hull, ; ⁵University of York, ; ⁶Institute of Computer Science, Foundation for Research and Technology, Heraklion, Greece, ; ⁷Dept. of Computer Science, KIOS Centre of Excellence, University of Cyprus, ; ⁸KIOS Research and Innovation Center of Excellence and the Department of Electrical and Computer Engineering, Abstract Unmanned Aerial Vehicles (UAVs) have become essential in search and rescue operations, especially in disaster management scenarios. Their effective navigation and the integration of a plethora of sensors assist in efficient person detection, making them an essential technological tool to first responders. Multi-UAV systems extend these benefits by using coordinated strategies to cover large areas efficiently, reducing overall mission response time and enhancing its success. Despite these advantages, challenges remain in ensuring the safety, security, and dependability of (mutli-)UAV missions. Issues such as navigation risks, potential cyber threats, and hardware-/software-related reliability issues can impact the mission results. Additionally, UAVs are highly constrained devices with limited battery capacity, requiring the use of lightweight technologies. In this paper, we present part of the results of the SESAME project, an EU multi-partner project that aims to develop safe and secure multi-robot Systems. In particular, we present some of the developed SESAME Executable Digital Dependability Identities (EDDI) technologies based on Markov models, statistical distance measures, and other advanced approaches for enhancing safety, security and dependability of the UAV platform and underlying models. These EDDI technologies are seamlessly integrated using the ConSerts framework in a multi-UAV platform and tested using search and rescue scenarios. The results demonstrate significant improvements in multi-UAV safety, with an availability rate of 91% and a search and rescue algorithmic accuracy of 99.8%. Additionally, the system achieves precise detection of spoofing attacks, using collaborative localization as a mitigation technique to guide the UAV to a safe landing, even in the absence of GPS signals.
16:55 CEST	MPP03.6	MULTI-PARTNER PROJECT: KEY ENABLING TECHNOLOGIES FOR COGNITIVE COMPUTING CONTINUUM - MYRTUS PROJECT PERSPECTIVE Speaker: Francesca Palumbo, Università degli Studi di Cagliari, IT Authors: Francesca Palumbo¹, Francesco Ratto², Claudio Rubattu³, Maria Katiuscia Zedda⁴, Tiziana Fanni⁴, Veena Rao⁵, Bart Driessen⁶ and Jeronimo Castrillon⁷ ¹University of Cagliari, IT; ²Università degli Studi di Cagliari, IT; ³University of Sassari, IT; ⁴Abinsula Srl, IT; ⁵HIRO microdatacenters, NL; ⁶TNO, NL; ⁷TU Dresden, DE Abstract The MYRTUS Horizon Europe project embraces the principles of the EU CloudEdgeIoT Initiative, integrating edge, fog, and cloud in a continuum of computing resources. MYRTUS intends to deliver abstractions, cognitive orchestration mechanisms, and a whole design environment to build and operate collaborative, distributed, heterogeneous systems. The goal is to provide high performance and play a crucial role in enabling energy efficiency and trustworthiness in nowadays systems.
17:00 CEST	MPP03.7	MULTI-PARTNER PROJECT: BIM-POWERED ENVIRONMENTAL DATA AGENT FOR MORE RESILIENT AND TRUSTWORTHY DATA CENTERS Speaker: Oğuzhan Herkiloğlu, BİTNET Bilişim Hizmetleri, Ltd., TR Authors: Oğuzhan Herkiloğlu¹, Ali Atalay², İbrahim Arif³, Salih Ergün⁴ and Alper Kanak⁵ ¹Bitnet Bilişim Hizmetleri Ltd. Sti., TR; ²AI4SEC OÖ, EE; ³Ergünler R&D Ltd. Co., TR; ⁴Ergtech SP.Z.O.O, PL; ⁵Ergünler R&D Co. Ltd., TR Abstract This paper introduces an agent-based approach that semantically integrates the Building Information Model (BIM), Geographical Information System (GIS), and the Environmental Data Agent (EDA)-based optimization interface between Information and Operational Technology (IT/OT) for more trusted and resilient data centers. Using the cybersecurity-aware BIM-GIS-IoT data model facilitates the exchange of requirements and forecasts to optimize energy use, environmental impact, availability, and costs in data centers. At the core of this solution, the EDA securely mediates data exchange between IT and OT, translating IT resource consumption into energy metrics for effective optimization.
17:01 CEST	MPP03.8	MULTI-PARTNER PROJECT: RESILIENT TSN NETWORKS (RESTSN) Speaker: Henia Rafik, cortAIx/Labs, THALES, FR Authors: Rafik Henia¹ and Marc Boyer² ¹Thales Research & Technology, FR; ²ONERA, FR Abstract TSN (Time-Sensitive Networking), an extension of Ethernet technology standardized by IEEE, appears to be a promising technology to unify traditional networks in modern vehicle systems. It offers several advantages, including a much higher bandwidth and enhanced determinism, efficiency, scalability, and interoperability. One notable advantage of TSN over traditional networks lies in its capability to support data traffic of different shape (e.g. synchronous control/command, asynchronous video…) within the same network infrastructure. This means that TSN can accommodate a diverse range of data traffic with varying levels of importance or urgency, e.g. in aircrafts from critical flight control systems to less time-sensitive passenger entertainment systems, all within a single network framework Its deployment in vehicles would therefore substantially improve communication systems and reduce operational costs. Transmitting critical and non-critical data over the same physical TSN network infrastructure will require maintaining reliable communication to ensure essential functionalities, such as braking systems or flight control. However, in the event of a network failure (due to hardware breakdown, software malfunction, or even a cyberattack), some of the communications are inevitably lost, potentially compromising the system's overall integrity. This type of issue can have severe consequences on the overall system reliability. The ResTSN project's objective is to enhance the resilience of TSN by allowing the network, in case of failure, to automatically and dynamically reconfigure itself, ensuring that its most critical functions continue to operate, even with reduced resources. Reconfiguring the TSN network allows isolating the malfunctioning components, thus preventing potential cascading failures. Additionally, reconfiguration ensures that the TSN network can be restored once the issue is resolved.

TS28 Test and Verification for Dependability

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Salon Pasteur

Session chair:
Stefano Quer, Politecnico di Torino, IT

Session co-chair:
Marcello Traiola, INRIA, FR

Time	Label	Presentation Title Authors
16:30 CEST	TS28.1	POLYNOMIAL FORMAL VERIFICATION OF SEQUENTIAL CIRCUITS USING WEIGHTED-AIGS Speaker: Mohamed Nadeem, University of Bremen, DE Authors: Mohamed Nadeem¹, Chandan Jha¹ and Rolf Drechsler² ¹University of Bremen, DE; ²University of Bremen \| DFKI, DE Abstract Ensuring the functional correctness of a digital system is achievable through formal verification. Despite the increased complexity of modern systems, formal verification still needs to be done in a reasonable time. Hence, Polynomial Formal Verification (PFV) techniques are being explored as they provide a guaranteed upper bound on the run time for verification. Recently, it was shown that combinational circuits characterized by a constant cutwidth can be verified in linear time using Answer Set Programming (ASP). However, most of the designs used in digital systems are sequential. Hence, in this paper, we propose a linear time formal verification approach using ASP for sequential circuits with constant cutwidth. We achieve this by proposing a new data structure called Weighted-And Inverter Graph (W-AIG). Unlike existing formal verification methods, we prove that our approach can verify any sequential circuit with a constant cutwidth in a linear time. Finally, we also implement our approach and experimentally show the results on a variety of sequential circuits like pipelined adders, serial adders, and shift registers to confirm our theoretical findings.
16:35 CEST	TS28.2	WORD-LEVEL COUNTEREXAMPLE REDUCTION METHODS FOR HARDWARE VERIFICATION Speaker: Zhiyuan Yan, Microelectronics Thrust, The Hong Kong University of Science and Technology(Guangzhou), CN Authors: Zhiyuan Yan¹ and Hongce Zhang² ¹The Hong Kong University of Science and Technology(Guangzhou), CN; ²The Hong Kong University of Science and Technology (Guangzhou), CN Abstract Hardware verification is crucial to ensure the correctness in the logic design of digital circuits. The purpose of verification is to either find bugs or show their absence. Prior works mostly focus on the bug-finding process and have proposed a range of verification algorithms and techniques to be faster to reach a bug or conclude with a proof of correctness. However, for a human verification engineer, it also matters how to better analyze the counterexamples trace to understand the root cause of bugs. This kind of technique remains absent in word-level circuit analysis. In this paper, we investigate the counterexample reduction method. Given the existing techniques for the bit-level circuit model, we first extend current semantic analysis methods to the word-level counterexample reduction and then develop a more efficient word-level structural analysis approach. We compare the effectiveness and overhead of these methods on the hardware model-checking problems and show the usefulness of such analysis in applications including pivot input analysis, word-level model-checking and counterexample-guided abstraction refinement.
16:40 CEST	TS28.3	ACCURATE AND EXTENSIBLE SYMBOLIC EXECUTION OF BINARY CODE BASED ON FORMAL ISA SEMANTICS Speaker: Sören Tempel, TU Braunschweig, DE Authors: Sören Tempel¹, Tobias Brandt², Christoph Lüth³, Christian Dietrich¹ and Rolf Drechsler³ ¹TU Braunschweig, DE; ²Independent, DE; ³University of Bremen \| DFKI, DE Abstract Symbolic execution is an SMT-based software verification and testing technique. Symbolic execution requires tracking performed computations during software simulation to reason about branches in the software under test. The prevailing approach on symbolic execution of binary code tracks computations by transforming the code to be tested to an architecture-independent IR and then symbolically executes this IR. However, the resulting IR must be semantically equivalent to the binary code, making this process complex and error-prone. The semantics of the binary code are specified by the targeted ISA, commonly given in natural language and requiring a manual implementation of the transformation to an IR. In recent years, the use of formal languages to describe ISA semantics in a machine-readable way has gained increased popularity. We investigate the utilization of such formal semantics for symbolic execution of binary code, achieving an accurate representation of instruction semantics. We present a prototype for the RISC-V ISA and conduct a case study to demonstrate that it can be easily extended to additional instructions. Furthermore, we perform an experimental comparison with prior work which resulted in the discovery of five previously unknown bugs in the ISA implementation of the popular IR-based symbolic executor angr.
16:45 CEST	TS28.4	EFFICIENT SAT-BASED BOUNDED MODEL CHECKING OF EVOLVING SYSTEMS Speaker: Sophie Andrews, Stanford University, US Authors: Sophie Andrews, Matthew Sotoudeh and Clark Barrett, Stanford University, US Abstract SAT-based verification is a common technique used by industry practitioners to find bugs in computer systems. However, these systems are rarely designed in a single step: instead, designers repeatedly make small modifications, reverifying after each change. With current tools, this reverification step takes as long as a full, from-scratch verification, even if the design has only been modified slightly. We propose a novel SAT-based verification technique that performs significantly better than the naive approach in the setting of evolving systems. The key idea is to reuse information learned during the verification of earlier versions of the system to speed up the verification of later versions. We instantiate our technique in a bounded model checking tool for SystemVerilog code and apply it to a new benchmark set based on real edit history for a set of open source RISC-V cores. This new benchmark set is now publicly available for further research on verification of evolving systems. Our tool, PrediCore, significantly improves the time required to verify properties on later versions of the cores compared to the current state-of-the-art, verify-from-scratch approach.
16:50 CEST	TS28.5	HIGH-THROUGHPUT SAT SAMPLING Speaker: Arash Ardakani, University of California, Berkeley, US Authors: Arash Ardakani¹, Minwoo Kang¹, Kevin He¹, Qijing Huang² and John Wawrzynek¹ ¹University of California, Berkeley, US; ²NVIDIA Corp., US Abstract In this work, we present a novel technique for GPU-accelerated Boolean satisfiability (SAT) sampling. Unlike conventional sampling algorithms that directly operate on conjunctive normal form (CNF), our method transforms the logical constraints of SAT problems by factoring their CNF representations into simplified multi-level, multi-output Boolean functions. It then leverages gradient-based optimization to guide the search for a diverse set of valid solutions. Our method operates directly on the circuit structure of refactored SAT instances, reinterpreting the SAT problem as a supervised multi-output regression task. This differentiable technique enables independent bit-wise operations on each tensor element, allowing parallel execution of learning processes. As a result, we achieve GPU-accelerated sampling with significant runtime improvements ranging from $33.6 imes$ to $523.6 imes$ over state-of-the-art heuristic samplers. We demonstrate the superior performance of our sampling method through an extensive evaluation on $60$ instances from a public domain benchmark suite utilized in previous studies.
16:55 CEST	TS28.6	SMT-BASED REPAIRING REAL-TIME TASK SPECIFICATIONS Speaker: Anand Yeolekar, TCS Research, IN Authors: Anand Yeolekar¹, RAVINDRA METTA¹ and Samarjit Chakraborty² ¹TCS, IN; ²UNC Chapel Hill, US Abstract When addressing timing issues in real-time systems, approaches for systematic timing debugging and repair have been missing due to (i) Lack of available feedback: most timing analysis techniques, being closed-form analytical techniques, are unable to provide root cause information when a timing property is violated, which is critical for identifying an appropriate repair, and (ii) Pessimism in the analysis: existing schedulability analysis techniques tend to make worst case assumptions in the presence of non-determinism introduced by real-world factors such as release jitter, or sporadic tasks. To address this gap, we propose an SMT encoding of task runs for exact debugging of timing violations, and a procedure to iteratively repair a given task specification. We demonstrate the utility of this procedure by repairing example task sets scheduled under global non-preemptive earliest-deadline- first scheduling, a common choice for many safety-critical systems.
17:00 CEST	TS28.7	HACHIFI: A LIGHTWEIGHT SOC ARCHITECTURE-INDEPENDENT FAULT-INJECTION FRAMEWORK FOR SEU IMPACT EVALUATION Speaker: Masanori Hashimoto, Kyoto University, JP Authors: Quan Cheng¹, Wang Liao², Ruilin Zhang¹, Hao Yu³, Longyang Lin³ and Masanori Hashimoto¹ ¹Kyoto University, JP; ²Kochi University of Technology, JP; ³Southern University of Science and Technology, CN Abstract Single-Event Upsets (SEUs), triggered by energetic particles, manifest as unexpected bit-flips in memory cells or registers, potentially causing significant anomalies in electronic devices. Driven by the needs of safety-critical applications, it is crucial to evaluate the reliability of these electronic devices before they are deployed. However, traditional reliability analysis techniques, such as irradiation experiments, are costly, while fault injection (FI) simulations often fail to provide full coverage and have limited effectiveness and accuracy. To address these issues, we introduce HachiFI, a lightweight, architecture-independent framework that automates fault injection with 100\% coverage via memory and scan-chain accesses and simulates the behavior of SEUs based on specific cross-sections. HachiFI supports configurable fault injection patterns for both system-level and module-level reliability analysis. Using HachiFI, we demonstrate a low hardware overhead (<2%) and a high match (R^2=0.984) between FI and irradiation experiments, verified on a 22nm edge-AI chip.
17:05 CEST	TS28.8	ACCELERATING CELL-AWARE MODEL GENERATION FOR SEQUENTIAL CELLS USING GRAPH THEORY Speaker: Gianmarco Mongelli, LIRMM, FR Authors: Gianmarco Mongelli¹, Eric Faehn², Dylan Robins², Patrick Girard³ and Arnaud Virazel³ ¹LIRMM and STMicroelectronics Crolles, FR; ²STMicroelectronics, FR; ³LIRMM, FR Abstract The Cell-Aware (CA) methodology has become essential to detect and diagnose manufacturing intra-cell defects in modern semiconductor technologies. It characterizes standard cells by creating a defect-detection matrix, which serves as a reference that maps stimuli to the specific defects they can detect. Its limitation is that CA approach needs a number of time-consuming analog simulations to create the matrix. In [1] a graph-based methodology able to reduce the number of simulations to perform, called Transistor Undetectable Defect eLiminator (TrUnDeL), was presented. TrUnDeL can identify undetectable stimulus/defect pairs that are then excluded from the analog simulations. However, its use is limited to combinational cells and does not offer any guidance on handling sequential cells, which are usually the most complex cells. In this paper we present a new version of TrUnDeL that supports sequential cells analysis. Experiments conducted on sequential cells from two standard cell industrial libraries demonstrate that the CA generation time is reduced by 30% without compromising accuracy.
17:10 CEST	TS28.9	AN EFFICIENT PARALLEL FAULT SIMULATOR FOR FUNCTIONAL PATTERNS ON MULTI-CORE SYSTEMS Speaker: Xiaoze Lin, Institute of Computing Technology, Chinese Academy of Sciences, CN Authors: Xiaoze Lin¹, Liyang Lai², Huawei Li³, Biwei Xie³ and Xingquan Li⁴ ¹State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, CN; ²Shantou University, CN; ³Institute of Computing Technology, Chinese Academy of Sciences, CN; ⁴Peng Cheng Laboratory, CN Abstract Fault simulation targeting functional patterns emerges as an essential mechanism within functional safety, crucial for validating the effectiveness of safety mechanisms. The acceleration of fault simulation for functional patterns is imperative for boosting the efficiency and adaptability of functional safety verification, presenting a significant yet unresolved challenge. In the paper, we propose an efficient fault simulator for functional patterns, utilizing three techniques including fault filtering, fault grouping, and CPU-based parallelism. The integration of these three techniques, tailored to the characteristics of functional patterns, reduces the runtime of fault simulation from different perspectives. The experimental results show that on a 48-core system, an average 79x speedup can be achieved by our parallel fault simulator against a commercial tool.
17:15 CEST	TS28.10	SPATIAL MODELING WITH AUTOMATED MACHINE LEARNING AND GAUSSIAN PROCESS REGRESSION TECHNIQUES FOR IMPUTING WAFER ACCEPTANCE TEST DATA Speaker: Ming-Chun Wei, National Cheng Kung University, TW Authors: Ming-Chun Wei, Hsun-Ping Hsieh and Chun-Wei Shen, National Cheng Kung University, TW Abstract The Wafer Acceptance Test (WAT) is a significant quality control measurement in the semiconductor industry. However, because the WAT process can be time-consuming and expensive, sampling test is commonly employed during production. This makes root cause tracing impossible when abnormal products have not been tested. Therefore, in our study, we focus on establishing a reliable method to estimate WAT results for non-tested shots, including both intra and inter-wafer prediction. Notably, we are the first to combine the use of Chip Probing data with WAT to improve the predictions. Our proposed method first extracts valuable features from Chip Probing test results by using the Automated Machine Learning technique. We then employ Gaussian Process Regression to capture the spatio-temporal correlation. Finally, we adopted the linear regression model to ensemble two components and proposed a SMART-WAT model to effectively estimate the wafer acceptance test data. Our method has been tested on a real-world dataset from the semiconductor manufacturing industry. The prediction results of four key WAT parameters indicate that our proposed model outperforms the state-of-the-art methods in both intra and inter-wafer prediction.
17:20 CEST	TS28.11	ON THE IMPACT OF WARPAGE ON BEOL GEOMETRY AND PATH DELAYS IN FAN-OUT WAFER-LEVEL PACKAGING Speaker: Dhruv Thapar, Arizona State University, US Authors: Dhruv Thapar¹, Arjun Chaudhuri¹, Christopher Bailey¹, Ravi Mahajan² and Krishnendu Chakrabarty¹ ¹Arizona State University, US; ²Intel Corporation, US Abstract Warpage is a major concern in fan-out wafer-level packaging (FOWLP) due to the complex thermal processing steps involved in manufacturing. These steps include curing, electroplating, and deposition, which induce residual stresses through differential thermal expansion and contraction of materials. This effect is further amplified by mismatches in the coefficients of thermal expansion (CTE) between different materials. In particular, high-density interconnects in the back-end of line (BEOL), redistribution layers (RDLs), and through-mold vias (TMVs) are susceptible to warpage-induced stress, strain, and deformation. This work conducts structural simulations to analyze warpage in the BEOL stack induced by FOWLP. Our results indicate that the impact of warpage is non-uniform across the entire BEOL geometry of a die, hence it impacts different metal layers differently, and different coordinates within one metal layer differently. We leverage this warpage analysis to calculate parasitics and evaluate the resulting changes in path delays
17:21 CEST	TS28.12	MODELING AND ANALYSIS TECHNIQUE FOR THE FORMAL VERIFICATION OF SYSTEM-ON-CHIP ADDRESS MAPS Speaker: Niels Mook, NXP, NL Authors: Niels Mook¹, Erwin de Kock¹, Bas Arts¹, Soham Chakraborty² and Arie van Deursen² ¹NXP Semiconductors, NL; ²TU Delft, NL Abstract This paper proposes a modeling and analysis technique to verify SoC address maps. The approach involves (i) modeling the specification and implementation address map using a unified graph model, and (ii) analysis of equivalence in terms of address maps between two such models. Using a state-of-the-art mid-size SoC design, we demonstrate the proposed solution is able to analyze and verify address maps of complex SoC designs and to identify the causes of discrepancies.
17:22 CEST	TS28.13	FREDDY: MODULAR AND EFFICIENT FRAMEWORK TO ENGINEER DECISION DIAGRAMS YOURSELF Speaker: Rune Krauss, DFKI, DE Authors: Rune Krauss¹, Jan Zielasko¹ and Rolf Drechsler² ¹DFKI, DE; ²University of Bremen \| DFKI, DE Abstract The hardware complexity in electronic devices used by today's society has increased significantly in recent decades due to technological progress. In order to cope with this complexity, data structures and algorithms in electronic design automation must be continuously improved. Decision Diagrams (DDs) are an important data structure in the design and analysis of circuits because they allow efficient algorithms for their manipulation. The practical relevance of DDs leads to an ongoing quest for appropriate software solutions that enable working with different DD types. Unfortunately, existing DD software libraries focus either on efficiency or usability. Consequences are a disproportionately high effort for extensions or considerable loss of performance. To tackle these issues, a modular and efficient Framework to Engineer Decision Diagrams Yourself (FrEDDY) is proposed in this paper. Various experiments demonstrate that no compromise with regard to performance has to be made when using FrEDDY. It is on par with or clearly more efficient than established DD libraries.

TS29 Approximate Computing Solutions

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: St Clair 3AB

Session chair:
Chang Meng, EPFL, CH

Session co-chair:
Yu-Guang Chen, National Central University, TW

Time	Label	Presentation Title Authors
16:30 CEST	TS29.1	EFFICIENT APPROXIMATE LOGIC SYNTHESIS WITH DUAL-PHASE ITERATIVE FRAMEWORK Speaker: Ruicheng Dai, Shanghai Jiao Tong University, CN Authors: Ruicheng Dai¹, Xuan Wang¹, Wenhui Liang¹, Xiaolong Shen², Menghui Xu², Leibin Ni², Gezi Li² and Weikang Qian¹ ¹Shanghai Jiao Tong University, CN; ²Huawei Technologies Co., Ltd., China, CN Abstract Approximate computing is an emerging paradigm to improve the energy efficiency for error-tolerant applications. Many iterative approximate logic synthesis (ALS) methods were proposed to automatically design approximate circuits. However, as the sizes of circuits grow, the runtime of ALS grows rapidly. Thus, a crucial challenge is to ensure circuit quality while improving the efficiency of ALS. This work proposes a dual-phase iterative framework to accelerate the iterative ALS flows. In the first phase, a comprehensive circuit analysis is performed to gather the necessary information, including the error information. In the second phase, minimal incremental computation is employed based on the information from the first phase. The experimental results show that the proposed method achieves an acceleration by up to 21.8× without loss of circuit quality compared to the state-of-the-art methods.
16:35 CEST	TS29.2	EFFICIENT APPROXIMATE NEAREST NEIGHBOR SEARCH VIA DATA-ADAPTIVEPARAMETER ADJUSTMENT IN HIERARCHICAL NAVIGABLE SMALL GRAPHS Speaker: Huijun Jin, Yonsei University, KR Authors: Huijun Jin, Jieun Lee, Shengmin Piao, Sangmin Seo, Sein Kwon and Sanghyun Park, Yonsei University, KR Abstract Abstract—Hierarchical Navigable Small World (HNSW) graphs are a state-of-the-art solution for approximate nearest neighbor search, widely applied in areas like recommendation systems, computer vision, and natural language processing. However, the effectiveness of the HNSW algorithm is constrained by its reliance on static parameter settings, which do not account for variations in data density and dimensionality across different datasets. This paper introduces Dynamic HNSW, an adaptive method that dynamically adjusts key parameters — such as the M (number of connections per node) and ef (search depth) — based on both local data density and dimensionality of the dataset. The proposed approach improves flexibility and efficiency, allowing the graph to adapt to diverse data characteristics. Experimental results across multiple datasets demonstrate that Dynamic HNSW significantly reduces graph build time by up to 33.11% and memory usage by up to 32.44%, while maintaining comparable recall, thereby outperforming the conventional HNSW in both scalability and efficiency. Keywords—Approximate Nearest Neighbor Search, Hierarchical Navigable Small World, Dynamic Parameter Tuning, Data-adaptive
16:40 CEST	TS29.3	HAAN: A HOLISTIC APPROACH FOR ACCELERATING LAYER NORMALIZATION IN LARGE LANGUAGE MODELS Speaker: Sai Qian Zhang, New York University, US Authors: Tianfan Peng¹, Tianhua Xia², Jiajun Qin³ and Sai Qian Zhang⁴ ¹Tongji University, CN; ²Independent Researcher, US; ³Zhejiang University, CN; ⁴New York University, US Abstract Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated architectural components aimed at improving training stability, convergence speed, and generalization capabilities. Among these components, normalization operation, such as layer normalization (LayerNorm), emerges as a pivotal technique, offering substantial benefits to the overall model performance. However, previous studies have indicated that normalization operations can substantially elevate processing latency and energy usage.In this work, we adopt the principles of algorithm and hardware co-design, introducing a holistic normalization accelerating method named HAAN. The evaluation results demonstrate that HAAN can achieve significantly better hardware performance compared to state-of-the-art solutions.
16:45 CEST	TS29.4	MCTA: A MULTI-STAGE CO-OPTIMIZED TRANSFORMER ACCELERATOR WITH ENERGY-EFFICIENT DYNAMIC SPARSE OPTIMIZATION Speaker: Heng Liu, Harbin Institute of Technology, CN Authors: Heng Liu, Ming Han, Jin Wu, Ye Wang and Jian Dong, Harbin Institute of Technology, CN Abstract As Transformer-based models continue to enhance service quality across various domains, their intensive computational requirements are exacerbating the AI energy crisis. Traditional energy-efficient Transformer architectures primarily focus on optimizing the Attention stage due to its high algorithmic complexity (O(n^2)). However, linear layers can also be significant energy consumers, sometimes accounting for over 70% of total energy usage. Although existing approaches such as sparsity have improved the Attention stage, the optimization space within such linear layers is not fully exploited. In this paper, we introduce the multi-stage co-optimized Transformer accelerator (MCTA) for optimizing energy efficiency. Our approach independently enhances the Query-Key-Value generation, Attention, and Feed-forward Neural Network stages. It employs two novel techniques: Low-overhead Mask Generation (LMG) for dynamically identifying unimportant calculations with minimal energy costs, and Cascaded Mask Derivation (CMD) for streamlining the mask generation process through parallel processing. Experimental results show that MCTA achieves an average energy reduction of 1.48× with only a 1% accuracy loss compared to state-of-the-art accelerators. This work demonstrates the potential for significant energy savings in Transformer models without the need for retraining, paving the way for more sustainable AI applications.
16:50 CEST	TS29.5	CIRCUITS IN A BOX: COMPUTING HIGH-DIMENSIONAL PERFORMANCE SPACES FOR ANALOG INTEGRATED CIRCUITS Speaker: Juergen Kampe, Ernst-Abbe-Hochschule Jena, DE Authors: Benedikt Ohse, Jürgen Kampe and Christopher Schneider, Ernst-Abbe-Hochschule Jena, DE Abstract Performance spaces contain information about all combinations of attainable performance parameters of analog integrated circuits. Their exploration allows designers to evaluate given circuits without considering implementation details, making them a valuable tool to support the design process. The computation of performance spaces---even for a small number of considered parameters---is time-consuming because it requires solving multi-objective, non-convex optimization problems that involve costly circuit simulations. We present a numerical method for efficiently approximating high-dimensional performance spaces, which is based on the box-coverage method known from Pareto optimization. The resulting implementation not only outperforms state-of-the-art solvers based on the well-known Normal-Boundary Intersection method in terms of computational complexity, but also offers several advantages, such as a practical stopping criterion and the possibility of warm starting. Furthermore, we present an interactive visualization technique to explore performance spaces of any dimension, which can help system designers to make reliable topology decisions even without detailed technical knowledge of the underlying circuits. Numerical experiments that confirm the efficiency of our approach are performed by computing seven-dimensional performance spaces for an analog low-dropout regulator as used in the radio-frequency identification domain.
16:55 CEST	TS29.6	GRADIENT APPROXIMATION OF APPROXIMATE MULTIPLIERS FOR HIGH-ACCURACY DEEP NEURAL NETWORK RETRAINING Speaker: Chang Meng, EPFL, Switzerland, CN Authors: Chang Meng¹, Wayne Burleson², Weikang Qian³ and Giovanni De Micheli¹ ¹EPFL, CH; ²U Massachusetts Amherst, US; ³Shanghai Jiao Tong University, CN Abstract Approximate multipliers (AppMults) are widely employed in deep neural network (DNN) accelerators to reduce the area, delay, and power consumption. However, the inaccuracies of AppMults degrade DNN accuracy, necessitating a retraining process to recover accuracy. A critical step in retraining is computing the gradient of the AppMult, i.e., the partial derivative of the approximate product with respect to each input operand. Conventional methods approximate this gradient using that of the accurate multiplier (AccMult), often leading to suboptimal retraining results, especially for AppMults with relatively large errors. To address this issue, we propose a difference-based gradient approximation of AppMults to improve retraining accuracy. Experimental results show that compared to the state-of-the-art methods, our method improves the DNN accuracy after retraining by 4.10% and 2.93% on average for the VGG and ResNet models, respectively. Moreover, after retraining a ResNet18 model using the 7-bit AppMult, the final DNN accuracy does not degrade compared to the quantized model using the 7-bit AccMult, while the power consumption is reduced by 51%.
17:00 CEST	TS29.7	SEGMENT-WISE ACCUMULATION: LOW-ERROR LOGARITHMIC DOMAIN COMPUTING FOR EFFICIENT LARGE LANGUAGE MODEL INFERENCE Speaker: Xinkuang Geng, Shanghai Jiao Tong University, CN Authors: Xinkuang Geng, Yunjie Lu, Hui Wang and Honglan Jiang, Shanghai Jiao Tong University, CN Abstract Logarithmic domain computing (LDC) has great potential for reducing quantization errors and computational complexity in Large Language Models (LLMs). While logarithmic multiplication can be efficiently implemented using fixed-point addition, the primary challenge in multiply-accumulate (MAC) operations is balancing the precision of logarithmic adders with their hardware overhead. Through a detailed analysis of the errors inherent in LDC-based LLMs, we propose segment-wise accumulation (SWA) to mitigate these errors. In addition, a processing element (PE) is introduced to enable SWA in the systolic array architecture. Compared with the accumulation scheme devised for enhancing floating-point computing, the proposed SWA facilitates the integration into existing accelerator architectures, resulting in lower hardware overhead. The experimental results show that SWA allows LDC under low-precision configurations to achieve remarkable accuracy in LLMs, demonstrating higher hardware efficiency than merely increasing the precision of individual computations. Our method, while maintaining a lower hardware overhead than traditional LDC, achieves more than 13.9% improvement in average accuracy across multiple zero-shot benchmarks in extsc{Llama-2-7B}. Furthermore, compared to integer domain computing, a logarithmic processing element array based on the proposed SWA yields reductions of 24.6% in area and 42.3% in power, while achieving higher accuracy.
17:05 CEST	TS29.8	LOOKUP TABLE REFACTORING: TOWARDS EFFICIENT LOGARITHMIC NUMBER SYSTEM ADDITION FOR LARGE LANGUAGE MODELS Speaker: Xinkuang Geng, Shanghai Jiao Tong University, CN Authors: Xinkuang Geng¹, Siting Liu², Hui Wang¹, Jie Han³ and Honglan Jiang¹ ¹Shanghai Jiao Tong University, CN; ²ShanghaiTech University, CN; ³University of Alberta, CA Abstract Compared to integer quantization, logarithmic quantization aligns more effectively with the long-tailed distribution of data in large language models (LLMs), resulting in lower quantization errors. Moreover, the logarithmic number system (LNS) employs a fixed-point adder to perform multiplication, indicating a potential reduction in computational complexity for LLM accelerators that require extensive multiply-accumulate (MAC) operations. However, a key bottleneck is that LNS addition requires complex nonlinear functions, which are typically approximated using lookup tables (LUTs). This study aims to reduce the hardware resources needed for LUTs in LNS addition while maintaining high precision. Specifically, we investigate the specific nature of addition operations within LLMs; the relationship between the hardware parameters of the LUT and the computing errors is then mathematically derived. Based on these insights, we propose LUT refactoring to optimize the LUT for enhanced efficiency in LNS addition. With 10.93% and 19.78% reductions in area-delay product (ADP) and power-delay product (PDP), respectively, LUT refactoring results in an accuracy improvement of up to 33.5% in LLM benchmarks compared to the naive design. When compared to integer quantization, our method achieves higher accuracy while reducing area by 18.27% and power by 42.61%.
17:10 CEST	TS29.9	EVASION: EFFICIENT KV CACHE COMPRESSION VIA PRODUCT QUANTIZATION Speaker: Zongwu Wang, Shanghai Jiao Tong University, CN Authors: Zongwu Wang¹, Fangxin Liu¹, Peng Xu¹, Qingxiao Sun², Junping Zhao³ and Li Jiang¹ ¹Shanghai Jiao Tong University, CN; ²China University of Petroleum, Beijing, CN; ³Ant Group, CN Abstract Large language models (LLMs) benefit from longer context lengths, but suffer from quadratic complexity in terms of attention mechanisms. KV caching alleviates this issue by storing pre-computed data, but its memory requirements increase linearly with context length, thereby hindering the intelligent development of LLMs. The traditional weight quantization scheme performs poorly in KV quantization due to two reasons: (1) KV requires dynamic quantization and de-quantization, which can lead to significant performance degradation; (2) Outliers are widely present in KV, which poses a challenge to low-bitwidth uniform quantization. This work proposes a novel approach called archname to achieve low-bitwidth quantization by product quantization. We thorough analyze the distribution of KV cache and demonstrate the limitations of existing quantization schemes. Then a non-uniform quantization algorithm based on product quantization is introduced, which offers efficient compression while maintaining accuracy. Finally, we design a high-performance GPU inference framework for archname, utilizing sparse computation and asynchronous quantization for further acceleration. Comprehensive evaluation results demonstrate that archname can achieve 4 bits quantization trivial perplexity and accuracy loss, it also achieves 1.8x end-to-end inference speedup.
17:11 CEST	TS29.10	SOFTEX: A LOW POWER AND FLEXIBLE SOFTMAX ACCELERATOR WITH FAST APPROXIMATE EXPONENTIATION Speaker: Andrea Belano, University of Bologna, Bologna, Italy, IT Authors: Andrea Belano¹, Yvan Tortorella¹, Angelo Garofalo², Davide Rossi¹, Luca Benini³ and Francesco Conti¹ ¹Università di Bologna, IT; ²University of Bologna, ETH Zurich, IT; ³ETH Zurich, CH \| Università di Bologna, IT Abstract Transformer-based models achieve remarkable results in a wide range of fields, including natural language processing, computer vision, and audio processing. Despite Transformers being computationally dominated by matrix multiplications (MatMul), a non-negligible portion of their runtime is also spent on executing the softmax operator. The softmax is a non-linear and non-pointwise operator that can become a performance bottleneck especially if dedicated hardware is used to decrease the runtime of MatMul operators. We introduce SoftEx, a parametric accelerator for the softmax function of BF16 vectors. SoftEx introduces an approximate exponentiation algorithm balancing efficiency (121× speedup over glibc's implementation) with accuracy (mean relative error of 0.14%). We integrate our design in a heterogeneous tightly-coupled cluster containing 256KiB of shared SRAM and 8 general-purpose RISC-V cores as well as a 24×8 systolic array MatMul accelerator. In 12nm technology, SoftEx occupies 0.033 mm², only 2.75% of the cluster, and achieves an operating frequency of 1.12 GHz. Computing the attention probabilities with SoftEx requires up to 10.8× less time and 26.8× less energy compared to a highly optimized software implementation running on the 8 cores, boosting the overall throughput on MobileBERT's attention layer by up to 2.17×, achieving a performance of 324 GOPS at 0.80V or 1.30 TOPS/W at 0.55V at full BF16 accuracy.

TS30 System Level Design and Test, Modeling and Verification

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Rhône 3AB

Session chair:
Cameron McNairy, Tenstorrent, US

Session co-chair:
Dimitris Gizopoulos, University of Athens, GR

Time	Label	Presentation Title Authors
16:30 CEST	TS30.1	ERASER: EFFICIENT RTL FAULT SIMULATION FRAMEWORK WITH TRIMMED EXECUTION REDUNDANCY Speaker: Jiaping Tang, State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences, CN Authors: Jiaping Tang¹, Jianan Mu¹, Silin Liu¹, Zizhen Liu¹, Feng Gu², Xinyu Zhang¹, Leyan Wang¹, Shengwen Liang², Jing Ye¹, Huawei Li¹ and Xiaowei Li³ ¹State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences/ CASTEST, China, CN; ²State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences, CN; ³State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences/ University of Chinese Academy of Sciences, China, CN Abstract As intelligent computing devices increasingly integrate into human life, ensuring the functional safety of the corresponding electronic chips becomes more critical. A key metric for functional safety is achieving a sufficient fault coverage. To meet this requirement, extensive time-consuming fault simulation of the RTL code is necessary during the chip design phase. The main overhead in RTL fault simulation comes from simulating behavioral nodes (always blocks). Due to the limited fault propagation capacity, fault simulation results often match the good simulation results for many behavioral nodes. A key strategy for accelerating RTL fault simulation is the identification and elimination of redundant simulations. Existing methods detect redundant executions by examining whether the fault inputs to each RTL node are consistent with the good inputs. However, we observe that this input comparison mechanism overlooks a significant amount of implicit redundant execution: although the fault inputs differ from the good inputs, the node's execution results remain unchanged. Our experiments reveal that this overlooked redundant execution constitutes nearly half of the total execution overhead of behavioral nodes, becoming a significant bottleneck in current RTL fault simulation. The underlying reason for this overlooked redundancy is that, in these cases, the true execution paths within the behavioral nodes are not affected by the changes in input values. In this work, we propose a behavior-level redundancy detection algorithm that focuses on the true execution paths. Building on the elimination of redundant executions, we further developed an efficient RTL fault simulation framework, Eraser. Experimental results show that compared to commercial tools, under the same fault coverage, our framework achieves a 3.9 × improvement in simulation performance on average.
16:35 CEST	TS30.2	PESEC -- A SIMPLE POWER-EFFICIENT SINGLE ERROR CORRECTING CODING SCHEME FOR RRAM Speaker: Shlomo Engelberg, Jerusalem College of Technology, IL Authors: Shlomo Engelberg¹ and Osnat Keren² ¹Jerusalem College of Technology, IL; ²Bar-Ilan University, IL Abstract The power consumed when writing to Resistive Random Access Memory (RRAM) is significantly greater than that consumed by many charge-based memories such as SRAM, DRAM and NAND-Flash memories. As a result, when used in applications where instantaneous power consumption is constrained, the number of bits that can be set or reset must not exceed a certain threshold. In this paper, we present a power-efficient, single error correcting (PESEC) code for memory macros, which, when combined with bus encoding, ensures low-power operation and reliable data storage. This systematic, multiple-representation based single-error correcting code provides a relatively high rate, with a marginal increase in implementation cost relative to that of a standard Hamming code, and it can be used with any bus encoder.
16:40 CEST	TS30.3	FROM GATES TO SDCS: UNDERSTANDING FAULT PROPAGATION THROUGH THE COMPUTE STACK Speaker: Odysseas Chatzopoulos, University of Athens, GR Authors: Odysseas Chatzopoulos¹, George Papadimitriou¹, Dimitris Gizopoulos¹, Harish Dixit² and Sriram Sankar² ¹University of Athens, GR; ²Meta Platforms Inc., US Abstract Silent Data Corruption (SDC) is the most severe effect of a silicon defect in a CPU or other computing chip. The arithmetic units of a CPU are, usually, unprotected and are, thus, the ones that most likely produce SDCs (as well as visible malfunctions of programs such as crashes). In this work, we shed light on the traversal of silicon defects from their point of origin deep inside arithmetic units of complex CPUs towards the program result. We employ microarchitecture-level fault injection enhanced with gate-level designs of the arithmetic units of interest. The hybrid setup combines (i) the accuracy of the hardware and fault modeling and (ii) the speed of program simulation to run long programs to end (thus observing SDC incidents); the analysis that this combination delivers is impossible at other abstraction layers which are either hardware-agnostic (software level) or extremely slow (gate-level). We quantify the effects of faults in two stages and with multiple metrics: (a) how faults propagate to the outputs of the arithmetic units when individual instructions are executed, and (b) how faults eventually affect the outcome of the program generating SDCs, crashes, or being masked. Our fine-grain findings can be utilized for informed fault detection and tolerance strategies at the hardware or the software levels.
16:45 CEST	TS30.4	RAPID FAULT INJECTION SIMULATION BY HASH-BASED DIFFERENTIAL FAULT EFFECT EQUIVALENCE CHECKS Speaker: Johannes Geier, TU Munich, DE Authors: Johannes Geier¹, Leonidas Kontopoulos¹, Daniel Mueller-Gritschneder² and Ulf Schlichtmann¹ ¹TU Munich, DE; ²TU Wien, AT Abstract Assessing a computational system's resilience to hardware faults is essential for safety and security-related systems. Fault Injection (FI) simulation is a valuable tool that can increase confidence in computational systems and guide hardware and software design decisions in the early stages of development. However, simulating hardware at low levels of abstraction, such as Register Transfer Level (RTL), is costly, and minimizing the effort required for large-scale FI campaigns is a significant objective. This work introduces Hash-based Differential Fault Effect Equivalence Checks to automatically terminate experiments early based on predicting their outcome. We achieve this by matching observed fault effects to ones already encountered in previous experiments. We generate these hashes from differentials computed by repurposing existing fast boot checkpoints from a state-of-the-art acceleration method. By integrating these approaches in an automated manner, we can accelerate a large-scale FI simulation of a CPU at RTL. We reduce the average simulation time by a factor of up to 25 compared to a factor of around 2 to 5 for state-of-the-art techniques. While maintaining 100 % accuracy, we can recover the faulty state through the stored differentials.
16:50 CEST	TS30.5	DEAR: DEPENDABLE 3D ARCHITECTURE FOR ROBUST DNN TRAINING Speaker: Ashish Reddy Bommana, Arizona State University, US Authors: Ashish Reddy Bommana¹, Farshad Firouzi², Chukwufumnanya Ogbogu³, Biresh Kumar Joardar⁴, Janardhan Rao Doppa³, Partha Pratim Pande³ and Krishnendu Chakrabarty¹ ¹Arizona State University, US; ²ASU, US; ³Washington State University, US; ⁴University of Houston, US Abstract ReRAM-based compute-in-memory (CiM) architectures present an attractive design choice for accelerating deep neural network (DNN) training. However, these architectures are susceptible to stuck-at faults (SAFs) in ReRAM cells, which arise from manufacturing defects and cell wearout over time, particularly due to the continuous weight updates during DNN training. These faults significantly degrade accuracy and compromise dependability. To address this issue, we propose DEAR: dependable 3D architecture for robust DNN training. DEAR introduces a novel online compensation method that employs a digital compensation unit to correct SAF-induced errors dynamically during both forward and backward propagation. This approach mitigates errors induced by SAFs during both the forward and backward phases of DNN training. Additionally, DEAR leverages an HBM-based 3D memory structure to store fault-related error information efficiently. Experimental results show that DEAR limits inferencing accuracy loss to under 2% even when up to 10% of cells are faulty with uniformly distributed faults, and under 2% for up to 5% faulty cells in clustered distributions. This high fault tolerance is achieved with an area overhead of 11.5% and energy overhead of less than 6% for VGG networks and less than 12% for ResNet networks.
16:55 CEST	TS30.6	IMPROVING SOFTWARE RELIABILITY WITH RUST: IMPLEMENTATION FOR ENHANCED CONTROL FLOW CHECKING METHODS Speaker: Jacopo Sini, Politecnico di Torino, IT Authors: Jacopo Sini¹, Mohammadreza Amel Solouki¹, Massimo Violante¹ and Giorgio Di Natale² ¹Politecnico di Torino, IT; ²TIMA - CNRS, FR Abstract The C language, traditionally used in developing safety-critical systems, often faces memory management issues, leading to potential vulnerabilities. Rust emerges as a safer and secure alternative, aiming to mitigate these risks with its robust memory protection features, making it suitable for producing reliable code in critical environments, such as the automotive industry. This study proposes employing Rust code hardened by Control Flow Checking (CFC) in real-time embedded systems, which software is traditionally developed by Assembly and C languages. The methods have been implemented at the application level, i.e., in the Rust source code, to make them platform-agnostic. A methodology for leveraging the Rust advantages is presented, such as stronger security guarantees and modern features, to implement these methods more effectively. Highlighting a use case in the automotive sector, our research demonstrates the Rust capacity to enhance system reliability through CFC, especially against Random Hardware Faults. Two CFC algorithms from the literature, YACCA, and RACFED, have been implemented in the Rust language to assess their effectiveness, obtaining 46.5\% Diagnostic Coverage for the YACCA method and 50.1\% for RACFED. The proposed approach is aligned with functional safety standards, showcasing how Rust can balance safety requirements and cost considerations in industries reliant on software solutions for critical functionalities.
17:00 CEST	TS30.7	BRIDGING THE GAP BETWEEN ANOMALY DETECTION AND RUNTIME VERIFICATION: H-CLASSIFIERS Speaker: Hagen Heermann, RPTU Kaiserlautern, DE Authors: Hagen Heermann and Christoph Grimm, University of Kaiserslautern-Landau, DE Abstract Runtime Verification (RV) and Anomaly Detection (AD) are crucial for ensuring the reliability of cyber-physical systems, but existing methods often suffer from high computational costs and lack of explainability. This paper presents a novel approach that integrates formal methods into anomaly detection, transforming complex system models into efficient classification tasks. By combining the strengths of RV and AD, our method significantly improves detection efficiency while providing explainability for failure causes. Our approach offers a promising solution for enhancing the safety and reliability of critical systems.
17:05 CEST	TS30.8	CRITICALITY AND REQUIREMENT AWARE HETEROGENEOUS COHERENCE FOR MIXED CRITICALITY SYSTEMS Speaker: Mohamed Hassan, McMaster University, CA Authors: Safin Bayes and Mohamed Hassan, McMaster University, CA Abstract We propose CoHoRT, as the first heterogeneous cache coherent solution for mixed criticality systems (MCS) equipped with several features that targets the characteristics and requirements of such systems. CoHoRT is requirementaware. It provides an optimization engine to optimally configure the architecture based on system requirements. CoHoRT is also criticality-aware. It introduces a low-cost novel architecture to enable cores to heterogeneously run different coherence protocols (time-based and MSI-based protocols). Moreover, it enables a run-time switch between these protocols to provide hardware support for mode operation switch, which is a common challenge in MCS. Our evaluation shows that CoHoRT outperforms existing solutions both from worst-case memory latency as well as overall average performance. It also illustrates that CoHoRT is able to meet timing requirements in various MCS setups and showcases CoHoRT's ability to adapt to mode switches.
17:10 CEST	TS30.9	PROTECTING CYBER-PHYSICAL SYSTEMS VIA VENDOR-CONSTRAINED SECURITY AUDITING WITH REINFORCEMENT LEARNING Speaker: Nan Wang, East China University of Science and Technology, CN Authors: Nan Wang¹, Kai Li¹, Lijun Lu¹, Zhiwei Zhao¹ and Zhiyuan Ma² ¹School of Information Science and Engineering, East China University of Science and Technology, CN; ²Institute of Machine Intelligence, University of Shanghai for Science and Technology, CN Abstract Hardware Trojans may cause security issues in cyber-physical systems (CPSs), and recently proposed mutual auditing frameworks have helped build trustworthy CPSs with untrustworthy devices by requiring neighboring devices from different vendors. However, this may cause severe multi-vendor integration challenges, such as expensive, hard-to-maintain, and insufficient vendors to purchase devices. In this work, we improve the mutual auditing framework by maintaining the security of the CPSs with fewer vendors. First, the vendor-constrained security auditing framework is introduced to enhance the security of the CPS network with limited vendors, where side-auditing detects the hardware Trojan collusion between neighboring nodes and infected node isolation stops the spread of active HTs. Second, a multi-agent cooperative reinforcement learning-based method is proposed to assign devices with proper vendors in the context of security auditing, and it provides solutions with a minimized number of offline nodes due to the HT infection. The experimental results show that our proposed method reduces the number of vendors needed by 40.95%, and only causes an increment of 0.39% infected nodes.
17:15 CEST	TS30.10	ADAPTIVE BRANCH-AND-BOUND TREE EXPLORATION FOR NEURAL NETWORK VERIFICATION Speaker: Kota Fukuda, Kyushu University, JP Authors: Kota Fukuda¹, Guanqin Zhang², Zhenya Zhang¹, Yulei Sui² and Jianjun Zhao¹ ¹Kyushu University, JP; ²University of New South Wales, AU Abstract Formal verification is a rigorous approach that can provably ensure the quality of neural networks, and to date, Branch and Bound (BaB) is the state-of-the-art that performs verification by splitting the problem as needed and applying off-the-shelf verifiers to sub-problems for improved performance. However, existing BaB may not be efficient, due to its naive way of exploring the space of sub-problems that ignores the importance of different sub-problems. To bridge this gap, we first introduce a notion of importance that reflects how likely a counterexample can be found with a sub-problem, and then we devise a novel verification approach, called ABONN, that explores the sub-problem space of BaB adaptively, in a Monte-Carlo tree search (MCTS) style. The exploration is guided by the importance of different sub-problems, so it favors the sub-problems that are more likely to find counterexamples. As soon as it finds a counterexample, it can immediately terminate; even though it cannot find, after visiting all the sub-problems, it can still manage to verify the problem. We evaluate ABONN with 552 verification problems from commonly-used datasets and neural network models, and compare it with the state-of-the-art verifiers as baseline approaches. Experimental evaluation shows that ABONN demonstrates speedups of up to 15.2x on MNIST and 24.7x on CIFAR-10. We further study the influences of hyperparameters to the performance of ABONN, and the effectiveness of our adaptive tree exploration.
17:20 CEST	TS30.11	TOWARDS COHERENT SEMANTICS: A QUANTITATIVELY TYPED EDSL FOR SYNCHRONOUS SYSTEM DESIGN Speaker: Rui Chen, KTH Royal Institute of Technology, SE Authors: Rui Chen and Ingo Sander, KTH Royal Institute of Technology, SE Abstract We present SynQ, an embedded DSL (EDSL) targeting synchronous system design with quantitative types. SynQ is designed to facilitate semantically coherent system design processes by language embedding and advanced type systems. The current case study indicates the potential for a seamless system design process.
17:21 CEST	TS30.12	CO-DESIGN OF SUSTAINABLE EMBEDDED SYSTEMS-ON-CHIP Speaker: Dominik Walter, FAU, DE Authors: Jan Spieck, Dominik Walter, Jan Waschkeit and Jürgen Teich, Friedrich-Alexander-Universität Erlangen-Nürnberg, DE Abstract This paper introduces a novel approach to the co-design of sustainable embedded systems through multi-objective design space exploration (DSE). We propose a two-phase methodology that optimizes both the multiprocessor system-on-chip (MPSoC) architecture and application mappings, considering sustainability, reliability, performance, and cost as optimization objectives. Our method thereby accounts for both operational and embodied emissions, providing a more comprehensive assessment of sustainability. First, an individual intra-application DSE is performed to explore Pareto-optimal constraint graphs for each application. The second phase, an inter-application DSE, combines these results to explore sustainable target architectures and corresponding application mappings. Our approach incorporates detailed models for embodied emissions (scope 1 and scope 2), operational emissions, reliability, performance, and cost. The evaluation demonstrates that our sustainability-aware DSE is able to explore design spaces, supported by superior results in four key objectives. This enables the development of sustainable embedded systems whilst achieving high performance and reliability.

TS31 Emerging Design Technologies

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: Rhône 2

Session chair:
Xueqing Li, Tsinghua University, CN

Session co-chair:
Lukas Sekanina, Brno University, CZ

Time	Label	Presentation Title Authors
16:30 CEST	TS31.1	GENETIC ALGORITHM-DRIVEN IMC MAPPING FOR CNNS USING MIXED QUANTIZATION AND MLC FEFETS Speaker: Alptekin Vardar, Fraunhofer IPMS, DE Authors: Alptekin Vardar, Franz Müller, Gonzalo Cuñarro Podestá, Nellie Laleni, Nandakishor Yadav and Thomas Kämpfe, Fraunhofer IPMS, DE Abstract Ferroelectric Field-Effect Transistors (FeFETs) are emerging as a highly promising non-volatile memory (NVM) technology for in-memory computing architectures, thanks to their low power consumption and non-volatility. These characteristics make FeFETs particularly well-suited for convolutional neural networks (CNNs), especially in power-constrained environments where minimizing the memory footprint is critical for improving both area efficiency and energy consumption. Two effective strategies for reducing memory requirements are quantization and the use of multi-level cell (MLC) configurations in NVMs. This work proposes a solution that combines mixed quantization schemes with FeFET-based MLC and single-level cell (SLC) configurations to balance memory usage and accuracy. Given the large hyperparameter space introduced by these combinations, we employ a genetic algorithm to efficiently explore and identify Pareto-optimal solutions, allowing flexible adaptation to various application-specific requirements. Our approach achieves significant improvements in both memory efficiency and performance, reducing memory usage by 50% while sacrificing only 3% accuracy compared to the 8-bit ResNet baseline. After a single epoch of retraining, the accuracy matches the baseline while fully retaining the memory savings. Additionally, when compared to the 4-bit baseline, a 46% memory reduction is achieved with virtually no loss in accuracy.
16:35 CEST	TS31.2	OPENMFDA: MICROFLUIDIC DESIGN AUTOMATION IN THREE DIMENSIONS Speaker: Ashton Snelgrove, University of Utah, US Authors: Ashton Snelgrove¹, Daniel Wakeham¹, Skylar Stockham¹, Scott Temple² and Pierre-Emmanuel Gaillardon¹ ¹University of Utah, US; ²Primis AI, US Abstract Current microfluidic design automation (MFDA) solutions are limited by the planarity requirements of current manufacturing techniques. Recent advances in stereolithography 3D printing create an opportunity for new MFDA design methodologies.We propose a methodology for the placement of microfluidic components and the routing of flow and control channels in three dimensions. Additionally, we propose a methodology for generating a printable 3D structure from the layout. We then present OpenMFDA, an open-source MFDA design flow implementing the proposed methodologies. This design flow takes a structural netlist and produces a sliced design for manufacturing using an SLA 3D printer. Our methodology demonstrates short run times and generates devices with 2-20$ imes$ smaller area compared to state-of-the-art MFDA tools.
16:40 CEST	TS31.3	CLAIRE: COMPOSABLE CHIPLET LIBRARIES FOR AI INFERENCE Speaker: Pragnya Nalla, University of Minnesota Twin Cities, US Authors: Pragnya Nalla¹, Emad Haque², Yaotian Liu², Sachin S. Sapatnekar¹, Jeff Zhang², Chaitali Chakrabarti² and Yu Cao¹ ¹University of Minnesota, US; ²Arizona State University, US Abstract Artificial intelligence has made a significant impact on fields like computer vision, Natural Language Processing (NLP), healthcare, and robotics. However, recent AI models, such as GPT-4 and LLaMAv3, demand significant number of computational resources, pushing monolithic chips to their technological and practical limits. 2.5D chiplet-based heterogeneous architectures have been proposed to address these technological and practical limits. While chiplet optimization for models like Convolutional Neural Networks (CNNs) is well-established, scaling this approach to accommodate diverse AI inference models with different computing primitives, data volumes, and different chiplet sizes is very challenging. A set of hardened IPs and chiplet libraries optimized for a broad range of AI applications is proposed in this work. We derive the set of chiplet configurations that are composable, scalable and reusable by employing an analytical framework trained on a diverse set of AI algorithms. Testing these set of library synthesized configurations on a different set of algorithms, we achieve a 1.99× − 3.99× improvement in non-recurring engineering (NRE) chiplet design costs, with minimal performance overhead compared to custom chiplet-based ASIC designs. Similar to soft IPs for SoC development, the library of chiplets improves flexibility, reusability, and efficiency for AI hardware designs.
16:45 CEST	TS31.4	A TALE OF TWO SIDES OF WAFER: PHYSICAL IMPLEMENTATION AND BLOCK-LEVEL PPA ON FLIP FET WITH DUAL-SIDED SIGNALS Speaker: Haoran Lu, Peking University, CN Authors: Haoran Lu, Xun Jiang, Yanbang Chu, Ziqiao Xu, Rui Guo, Wanyue Peng, Yibo Lin, Runsheng Wang, Heng Wu and Ru Huang, Peking University, CN Abstract As the conventional scaling of logic devices comes to an end, functional wafer backside and 3D transistor stacking are consensus for next-generation logic technology, offering considerable design space extension for powers, signals or even devices on the wafer backside. The Flip FET (FFET), a novel transistor architecture combining 3D transistor stacking and fully functional wafer backside, was recently proposed. With symmetric dual-sided standard cell design, the FFET can deliver around 12.5% cell area scaling and faster but more energy-efficient libraries beyond other stacked transistor technologies such as Complementary FET (CFET). Besides, thanks to the novel cell design with dual-sided pins, the FFET supports dual-sided signal routing, delivering better routability and larger backside design space. In this work, we demonstrated a comprehensive FFET evaluation framework considering physical implementation and block-level power-performance-area (PPA) assessment for the first time, in which key functions are dual-sided routing and dual-sided RC extraction. A 32-bit RISC-V core was used for the evaluation here. Compared to the CFET with single-sided signals, the FFET with single-sided signals (for fair comparison) achieved 23.3% post-P&R core area reduction, 25.0% higher frequency and 11.9% lower power at the same utilization, and 16.0 % higher frequency at the same core area. Meanwhile, the FFET supports dual-sided signals, which can further benefit more from flexible allocation of cell input pins on both sides. By optimizing the input pin density and BEOL routing layer number on each side, 10.6% frequency gain was realized without power degradation compared to the one with single-sided signal routing. Moreover, the routability and power efficiency of FFET barely degrades even with the routing layer number reduced from 12 to 5 on each side, validating the great space for cost-friendly design enabled by FFET.
16:50 CEST	TS31.5	COLUMN-WISE QUANTIZATION OF WEIGHTS AND PARTIAL SUMS FOR ACCURATE AND EFFICIENT COMPUTE-IN-MEMORY ACCELERATORS Speaker: Kang Eun Jeon, Sungkyunkwan University, KR Authors: Jiyoon Kim, Kang Eun Jeon, Yulhwa Kim and Jong Hwan Ko, Sungkyunkwan University, KR Abstract Compute-in-memory (CIM) is an efficient method for implementing deep neural networks (DNNs) but suffers from substantial overhead from analog-to-digital converters (ADCs), especially as ADC precision increases. Low-precision ADCs can reduce this overhead but introduce partial-sum quantization errors degrading accuracy. Additionally, low-bit weight constraints, imposed by cell limitations and the need for multiple cells for higher-bit weights, present further challenges. While fine-grained partial-sum quantization has been studied to lower ADC resolution effectively, weight granularity, which limits overall partial-sum quantized accuracy, remains underexplored. This work addresses these challenges by aligning weight and partial-sum quantization granularities at the column-wise level. Our method improves accuracy while maintaining dequantization overhead, simplifies training by removing two-stage processes, and ensures robustness to memory cell variations via independent column-wise scale factors. We also propose an open-source CIM-oriented convolution framework to handle fine-grained weights and partial-sums efficiently, incorporating a novel tiling method and group convolution. Experimental results on ResNet-20 (CIFAR-10, CIFAR-100) and ResNet-18 (ImageNet) show accuracy improvements of 0.99%, 2.69%, and 1.01%, respectively, compared to the best-performing related works. Additionally, variation analysis reveals the robustness of our method against memory cell variations. These findings highlight the effectiveness of our quantization scheme in enhancing accuracy and robustness while maintaining hardware efficiency in CIM-based DNN implementations. Our code is available at https://github.com/jiyoonkm/ColumnQuant.
16:55 CEST	TS31.6	DHD: DOUBLE HARD DECISION DECODING SCHEME FOR NAND FLASH MEMORY Speaker: Lanlan Cui, Xian University of Technology, CN Authors: Lanlan Cui¹, Yichuan Wang¹, Renzhi Xiao², Miao Li³, Xiaoxue Liu¹ and Xinhong Hei¹ ¹Xi'an University of Technology, CN; ²Jiangxi University of Science and Technology, CN; ³National University of Defense Technology, CN Abstract With the advancement of NAND flash technology, the increased storage density leads to intensified interference, which in turn raises the error rate during data retrieval. To ensure data reliability, low-density parity-check (LDPC) codes are extensively employed for error correction in NAND flash memory. Although LDPC soft decision decoding offers high error correction capability, it comes with a significant latency. Conversely, hard-decision decoding, although faster, lacks sufficient error correction strength. Consequently, flash memory typically initiates with hard-decision decoding and resorts to multiple soft decision decoding upon failure. To minimize decoding latency, this paper proposes a decoding mechanism based on the double hard decision, called DHD. This DHD scheme improves the Log-Likelihood Ratio (LLR) in the hard decision process. After the first hard decision fails, the read reference voltage (RRV) is adjusted to perform the second hard decision decoding. If the second hard decision also fails, soft decision decoding is then employed. Experimental results demonstrate that when the Raw Bit Error Rate (RBER) is 8.5E-3, DHD reduces the Frame Error Rate (FER) by 86.4% compared to the traditional method.
17:00 CEST	TS31.7	WRITE-OPTIMIZED PERSISTENT HASH INDEX FOR NON-VOLATILE MEMORY Speaker: Renzhi Xiao, Jiangxi University of Science and Technology, CN Authors: Renzhi Xiao¹, Dan Feng², Yuchong Hu², Yucheng Zhang², Lanlan Cui³ and Lin Wang² ¹Jiangxi University of Science and Technology, CN; ²Huazhong University of Science and Technology, CN; ³Xi'an University of Technology, CN Abstract A hashing index provides rapid search performance by swiftly locating key-value items. Non-volatile memory (NVM) technologies have driven research into hashing indexes for NVM, combining hard disk persistence with DRAM-level performance. Nevertheless, current NVM-based hashing indexes must tackle data inconsistency challenges caused by NVM write reordering or partial writes, and mitigate rapid local wear due to frequent updates, considering NVM's limited endurance. The temporary allocation of buckets in NVM-based chained hashing to resolve hash collisions prolongs the critical path for writing, thus hampering write performance. This paper presents WOPHI, a write-optimized persistent hash index scheme for NVM. By utilizing log-free failure-atomic writes, WOPHI minimizes data consistency overhead and addresses hash conflicts with bucket pre-allocation. Experimental results underscore WOPHI's significant performance enhancements, with insertion latency slashed by up to 88.2\% and deletion latency boosted by up to 82.6\% compared to existing state-of-the-art schemes. Moreover, WOPHI substantially mitigates data consistency overhead, reducing cache line flushes by 59.3\%, while maintaining robust write throughput for insert and delete operations.
17:05 CEST	TS31.8	DEAR-PIM: PROCESSING-IN-MEMORY ARCHITECTURE WITH DISAGGREGATED EXECUTION OF ALL-BANK REQUESTS Speaker: Jungi Hyun, Seoul National University, KR Authors: Jungi Hyun, Minseok Seo, Seongho Jeong, Hyuk-Jae Lee and Xuan Truong Nguyen, Seoul National University, KR Abstract Emerging transformer-based large language models (LLMs) involve many low-arithmetic intensity operations, which result in sub-optimal performance on general-purpose CPUs and GPUs. Processing-in-Memory (PIM) has shown promise in enhancing performance by reducing data movement bottlenecks. Commodity near-bank PIMs enable in-memory computation through bank-level compute units and typically rely on all-bank commands, which simultaneously operate the compute units of all banks to maximize internal bandwidth and parallelism. However, activating all banks simultaneously before issuing all-bank commands generally requires high peak power, which may exceed system power limit, when stacking multiple PIM devices for LLM inference. Additionally, under a DRAM power constraint, all-bank commands are only issued after all banks are fully activated through a sequence of single-bank activations, incurring bubble cycles and degrading overall performance. To address these shortcomings, this study proposes DEAR-PIM, a novel PIM architecture with Disaggregated Execution of All-bank Requests. DEAR-PIM incorporates disaggregated command queue, allowing it to buffer all-bank commands and provide them to each bank sequentially without waiting to complete all-bank activations. However, since all banks must finish their disaggregated execution before simultaneous post-processing, synchronization between early-activated and last-activated banks is necessary. To tackle the issue, DEAR-PIM introduces a column-aware synchronization command scheme that inserts no-op-like commands into unused columns without modifying the memory controller. Experiments demonstrate that DEAR-PIM achieves a speedup of 2.03-3.33× over an A100 GPU and improves performance by 1.11-1.52× compared to the sequential activation scheme. DEAR-PIM also reduces the peak power consumption by 21.3-41.7% compared to the simultaneous activation scheme.
17:10 CEST	TS31.9	SYNDCIM: A PERFORMANCE-AWARE DIGITAL COMPUTING-IN-MEMORY COMPILER WITH MULTI-SPEC-ORIENTED SUBCIRCUIT SYNTHESIS Speaker: Kunming Shao, The Hong Kong University of Science and Technology, HK Authors: Kunming Shao¹, Fengshi Tian¹, Xiaomeng WANG¹, Jiakun Zheng¹, Jia Chen², Jingyu He¹, Hui Wu³, Jinbo Chen³, Xihao Guan¹, Yi Deng², Fengbin Tu¹, Jie Yang³, Mohamad Sawan³, Tim Cheng¹ and Chi Ying Tsui¹ ¹The Hong Kong University of Science and Technology, HK; ²AI Chip Center for Emerging Smart Systems (ACCESS),Hong Kong University of Science and Technology, HK; ³Westlake University, CN Abstract Digital Computing-in-Memory (DCIM) is an innovative technology that integrates multiply-accumulation (MAC) logic directly into memory arrays to enhance the performance of modern AI computing. However, the need for customized memory cells and logic components currently necessitates significant manual effort in DCIM design. Existing tools for facilitating DCIM macro designs struggle to optimize subcircuit synthesis to meet user-defined performance criteria, thereby limiting the potential system-level acceleration that DCIM can offer. To address these challenges and enable the agile design of DCIM macros with optimal architectures, we present SynDCIM — a performance-aware DCIM compiler that employs multi-spec-oriented subcircuit synthesis. SynDCIM features an automated performance-to-layout generation process that aligns with user-defined performance expectations. This is supported by a scalable subcircuit library and a multi-spec-oriented searching algorithm for effective subcircuit synthesis. The effectiveness of SynDCIM is demonstrated through extensive experiments and validated with a test chip fabricated in a 40nm CMOS process. Testing results reveal that designs generated by SynDCIM exhibit competitive performance when compared to state-of-the-art manually designed DCIM macros.

US02 Unplugged session

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 16:30 CEST - 18:00 CEST
Location / Room: St Clair 1

Organisers:
Hans Vangheluwe, University of Antwerp and Flanders Make, BE
Pieter J. Mosterman, CNH Industrial, NL

CC Closing Ceremony

Add this session to my calendar

Date: Wednesday, 02 April 2025
Time: 18:00 CEST - 18:30 CEST
Location / Room: Auditorium Pasteur

Time	Label	Presentation Title Authors
18:00 CEST	CC.1	CLOSING REMARKS Presenter: Aida Todri-Sanial, Eindhoven University of Technology, NL Authors: Aida Todri-Sanial¹ and Theocharis Theocharides² ¹Eindhoven University of Technology, NL; ²University of Cyprus, CY Abstract Closing remarks from the Chairs
18:15 CEST	CC.2	SAVE THE DATE 2026 Presenter: Valeria Bertacco, University of Michigan, US Author: Valeria Bertacco, University of Michigan, US Abstract See you all at DATE 2026!