背景(Background)
The artificial intelligence (AI) capability exhibited by Deep neural networks is progressively surpassing the cognitive abilities of human brain, demonstrating remarkable performance in handling a diverse array of intricate and intellectually demanding tasks across various domains. This achievement highlights the immense potential of machine learning and cognitive computing, ushering in significant transformations and possibilities across all areas of society. However, with the continuous demonstration of remarkable intelligence by DNNs, we are compelled to confront a fundamental theoretical question: Is there an unknown boundary between the capabilities of DNNs and the human brain? This issue involves multiple dimensions including technology, philosophy, and ethics.
创新点(Highlights)
In this study, we propose a novel working capability analysis framework for DNNs, called the cognitive response model with fine adjustable visual illusion scenes (CRVIS). The framework revolves around the unknown boundary issue of DNNs’ working capability, with the hypothesis that DNN performance may differ from human cognition in certain scenes of visual illusion. Conducts a series of cognitive response studies to thoroughly analyze DNN performance in various visual illusion scenes. In particular, our groundbreaking efforts led to the development of a generation method tailored specifically for visual illusion scenes. By leveraging advanced techniques and incorporating key elements from classic visual illusion scenes, such as the Kanizsa figures, the Ehrenstein illusions, the Abutting line gratings and Color Blindness Ishihara plates, we establish adjustable strategies for generating sample images. This innovative model is designed to automate the process of generating scene images equipped with precise semantic labels, including MNIST-Abutting grating, Kanizsa Polygon-Abutting grating, ColorMNIST-Abutting grating and COCO-Abutting grating scene images.
方法(Methodology)
The MNIST-Abutting grating images generating network comprises three layers, capable of accurately generating high-quality visual scenes with adjustable properties. In the first layer, standard MNIST images are modeled within the Unity platform based on information from standard MNIST binary files. Building upon the optimization of edge details, the network also supports the generation of high-resolution images at any given dimension, providing a foundational framework for creating larger visual scenes. Inspired by LabelMe method, the second layer employs an automated annotation technique to calculate label information about visual images, including gradients and rectangle vertices, which correspond to the dimensions of the generated images, ensuring precise mapping between image data and its associated labels. The final layer introduces an adjacent abutting grating generation algorithm, producing visual illusion images with adjustable sparsity levels, allowing for customization based on specific requirements or experimental conditions. In the grating generation process, the core logic utilizes modular arithmetic to calculate the width and color of stripes for each position of the pixel.
结论(Conclusion)
In this study, thorough consideration is given to model selection, with ten representative deep neural network models chosen from the current cutting-edge technologies in the field of deep learning. These models include traditional convolutional neural networks such as AlexNet, VGG11, ResNet18 and DenseNet121, as well as state-of-the-art object detection and segmentation models like YOLOv8 and Mask R-CNN. Additionally, attention-based Transformer models such as Vision Transformer (ViT), Mobile Vision Transformer (MobileViT), Swin Transformer and Segment Anything Model (SAM) are included. Through response analysis, we aim to investigate underlying mechanisms of analyzable boundary issues in robustness and generalization capabilities. We observed that DNNs demonstrate superior accuracy and reliability in processing similar images, composed of known foreground and background patterns. However, when dealing with dissimilar scene images, consisting of unknown patterns, the working performance of models sharply declines. This phenomenon is purely attributed to the occurrence of errors or misclassifications in models due to changes in foreground or background patterns, even in dissimilar images where foreground patterns remain the same. These results intuitively demonstrate that unknown boundaries exist within working capability of DNNs. These boundaries highlight challenges DNNs face when generalizing to dissimilar situations, reflecting our limited comprehension of working capability boundary.
相关链接(Links)
Li, T., Lyu, R. and Xie, Z., 2024. Pattern memory cannot be completely and truly realized in deep neural networks. Scientific Reports, 14(1), p.31649.