Neural Network-Based Exfiltration Schema Identification

Vetrick Aringga Dicktiony Racero; Agung  Prasetya; Taufiq  Agung Cahyono

doi:10.36378/jtos.v9i1.5475

Vetrick Aringga Dicktiony Racero Universitas Bhinneka PGRI Tulungagung
Agung Prasetya Universitas Bhinneka PGRI Tulungagung
Taufiq Agung Cahyono Universitas Bhinneka PGRI Tulungagung

DOI: https://doi.org/10.36378/jtos.v9i1.5475

Keywords: Text-to-SQL, Schema Exfiltration, BERT, Neural Network, Deep Learning

Abstract

This study uses the BERT architectural technique to identify schema exfiltration in a neural network-based Text-to-SQL system. The growing usage of Large Language Models (LLM) in Text-to-SQL systems, which may provide a danger of database schema leaking through user prompts, provides the context for this study. This research challenge is how to use a deep learning model to reliably and adaptively identify prompt modifications that could carry out exfiltration techniques. The study employed a deep learning strategy with a feedforward neural network as the classifier and the BERT architecture as the primary encoder. There were 20 classes in all, consisting of 19 exfiltration scheme categories and 1 benign class. The dataset was created using a variety of sources, including WikiSQL, DatabaseAnswers, and educational datasets. It was then subjected to tokenisation, labelling, and normalization processes. The model obtains an accuracy of 0.9462, precision of 0.8425, recall of 0.7483, F1-score of 0.7926, and precise match accuracy of 0.7596, according to the data. Additionally, the study demonstrated that the model outperformed implicit suggestions like role switching and prompt injection in identifying explicit prompts. The study concludes that while there are still issues with enhancing detection capabilities for intricate manipulating patterns, the BERT-based approach can provide good performance in identifying schema exfiltration in Text-to-SQL systems.

Downloads

Download data is not yet available.

References

Y. Salim and M. Hasnawi, “Konversi Bahasa Indonesia ke Perintah Data Manipulation Language pada Structured Query Language menggunakan Natural Language Processing,” Bul. Sist. Inf. dan Teknol. Islam, vol. 3, no. 3, pp. 181–187, 2022.

M. N. Gifari and A. Prasetya, “Pendekatan Berbasis Rule Untuk Mengidentifikasi Pertanyaan Tak Terdefinisi Pada Masalah Text-to-SQL,” J. Borneo Inform. Tek. Komput., vol. 5, no. 1, pp. 21–25, 2025.

Y. Huang et al., “Exploring the Landscape of Text-to-SQL with Large Language Models: Progresses, Challenges and Opportunities,” vol. 37, no. 4, 2024.

R. Sylvia, “Penggunaan AI Berbasis Large Language Models (LLM) sebagai Media Interaktif dalam Pendidikan Bahasa dan Hukum di Perguruan Tinggi,” Disiplin Maj. Civ. Akad. Sekol. Tinggi Ilmu Huk. Sumpah Pemuda, vol. 31, no. 4, pp. 233–242, 2025, [Online]. Available: https://ojs.stihpada.ac.id/index.php/disiplin

R. Ahadi, N. S. Harahap, M. Fikry, and F. Kurnia, “Retrieval-Augmented Generation in a Web-Based Question Answering System for Fiqh Books,” J. Artif. Intell. Softw. Eng., vol. 5, no. 2, pp. 626–635, 2025, doi: 10.30811/jaise.v5i2.7005.

A. Mohammadjafari, A. S. Maida, and R. Gottumukkala, “From Natural Language to SQL : Review of LLM-based Text-to-SQL Systems,” 2025.

Y. Xie et al., “Decomposition for Enhancing Attention : Improving LLM-based Text-to-SQL through Workflow Paradigm,” pp. 10796–10816, 2024.

T. Zhang, C. Chen, J. Wang, and J. Wang, “SQLfuse : Enhancing Text-to-SQL Performance through Comprehensive LLM Synergy,” 2024.

D. Gao et al., “Text-to-SQL Empowered by Large Language Models : A Benchmark Evaluation,” 2023, doi: 10.14778/3641204.3641221.

Q. Liu, M. J. Kusner, and P. Blunsom, “A Survey on Contextual Embeddings,” 2020.

K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing,” pp. 1–42, 2021.

H. Zhang, R. Cao, H. Xu, L. Chen, and K. Yu, “CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions,” vol. 1, pp. 6487–6508, 2024.

A. Usta, A. Karakayali, and O. Ulusoy, “xDBTagger : Explainable Natural Language Interface to Databases Using Keyword Mappings and Schema Graph,” 2022.

Y. Song, R. Liu, S. Chen, Q. Ren, Y. Zhang, and Y. Yu, “SecureSQL : Evaluating Data Leakage of Large Language Models as Natural Language Interfaces to Databases,” Find. ofthe Assoc. Comput. Linguist. EMNLP, vol. 12, no. 16, pp. 5975–5990, 2024, doi: https://doi.org/10.18653/v1/2024.findings-emnlp.346.

K. Greshake, C. Endres, S. Abdelnabi, S. Mishra, T. Holz, and M. Fritz, Not what you ’ ve signed up for : Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, vol. 1, no. 1. Association for Computing Machinery, 2023.

M. Lin, H. Zhang, J. Lao, R. Li, Y. Zhou, and C. Yang, Are Your LLM-based Text-to-SQL Models Secure ? Exploring SQL Injection via Backdoor Attacks, vol. 1, no. 1. arXiv, 2025.

L. Shi, Z. Tang, N. A. N. Zhang, X. Zhang, and Z. H. I. Yang, “A Survey on Employing Large Language Models for Text-to-SQL Tasks,” ACM Comput. Surv, vol. 58, no. 2, 2025, doi: https://doi.org/10.1145/3737873.

A. Fatwanto, F. Zamakhsyari, R. Ndungi, and L. Fitriyani, “R ESEARCH A RTICLE A Systematic Literature Review of BERT-based Models for Natural Language Processing Tasks,” J. INFOTEL, vol. 16, no. 4, pp. 713–728, 2024, doi: 10.20895/INFOTEL.V16I3.1206.

A. Ayub and S. Majumdar, “Embedding-based classifiers can detect prompt injection attacks,” vol. 0110, pp. 0–2, 2024.

A. Marshan, A. N. Almutairi, A. Ioannou, D. Bell, A. Monaghan, and M. Arzoky, “MedT SQL : a transformers-based large language model for text-to-SQL conversion in the healthcare domain,” Front. Big Data, 2024, doi: 10.3389/fdata.2024.1371680.

Y. Zheng, H. Wang, B. Dong, X. Wang, and C. Li, “HIE-SQL: History Information Enhanced Network for Context-Dependent Text-to-SQL Semantic Parsing,” 2021.

M. Biesialska, K. Biesialska, and H. Rybinski, “Leveraging contextual embeddings and self-attention neural networks with bi-attention for sentiment analysis,” J. Intell. Inf. Syst., 2021, doi: https://doi.org/10.1007/s10844-021-00664-7.

Ð. ¯de Klisura and A. Rios, “Unmasking Database Vulnerabilities : Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems,” pp. 6969–6991, 2025, doi: https://doi.org/10.18653/v1/2025.findings-naacl.386.

B. E. S. Dewi, “Pengukuran Kemiripan Kalimat Bahasa Indonesia Menggunakan Representasi Word Embedding Fasttext,” Teknol. Inform. Komput., vol. 2, no. 2, pp. 20–29, 2025.

M. Alizadeh, Z. Samei, D. Stetsenko, and F. Gilardi, “Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution,” pp. 1–25, 2025.