In this comprehensive study, a novel MATLAB to Python (M-to-PY) conversion process is showcased, specifically tailored for an intricate image skeletonization project involving fifteen MATLAB files and a large dataset. The central innovation of this research is the adept use of ChatGPT-4 as an AI assistant, pivotal in crafting a prototype M-to-PY converter. This converter’s capabilities were thoroughly evaluated using a set of test cases generated by the Bard bot, ensuring a robust and effective tool. The culmination of this effort was the development of the Skeleton App, adept at image sketching and skeletonization. This live and publicly available app underscores the enormous potential of AI in enhancing the transition of scientific research from MATLAB to Python. The study highlights the blend of AI’s computational prowess and human ingenuity in computational research, making significant strides in AI-assisted scientific exploration and tool development.
more »
« less
Validation of AI models for ITCZ Detection from Climate Data
This paper presents an innovative testing framework, testFAILS, designed for the rigorous evaluation of AI Linguistic Systems, with a particular emphasis on various iterations of ChatGPT. Leveraging orthogonal array coverage, this framework provides a robust mechanism for assessing AI systems, addressing the critical question, "How should we evaluate AI?" While the Turing test has traditionally been the benchmark for AI evaluation, we argue that current publicly available chatbots, despite their rapid advancements, have yet to meet this standard. However, the pace of progress suggests that achieving Turing test-level performance may be imminent. In the interim, the need for effective AI evaluation and testing methodologies remains paramount. Our research, which is ongoing, has already validated several versions of ChatGPT, and we are currently conducting comprehensive testing on the latest models, including ChatGPT-4, Bard and Bing Bot, and the LLaMA model. The testFAILS framework is designed to be adaptable, ready to evaluate new bot versions as they are released. Additionally, we have tested available chatbot APIs and developed our own application, AIDoctor, utilizing the ChatGPT-4 model and Microsoft Azure AI technologies.
more »
« less
- Award ID(s):
- 2034030
- PAR ID:
- 10430159
- Date Published:
- Journal Name:
- Proceedings of 2022 5th International Conference on Data Science and Information
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Artificial Intelligence (AI) bots receive much attention and usage in industry manufacturing and even store cashier applications. Our research is to train AI bots to be software engineering assistants, specifically to detect biases and errors inside AI software applications. An example application is an AI machine learning system that sorts and classifies people according to various attributes, such as the algorithms involved in criminal sentencing, hiring, and admission practices. Biases, unfair decisions, and flaws in terms of the equity, diversity, and justice presence, in such systems could have severe consequences. As a Hispanic-Serving Institution, we are concerned about underrepresented groups and devoted an extended amount of our time to implementing “An Assure AI” (AAAI) Bot to detect biases and errors in AI applications. Our state-of-the-art AI Bot was developed based on our previous accumulated research in AI and Deep Learning (DL). The key differentiator is that we are taking a unique approach: instead of cleaning the input data, filtering it out and minimizing its biases, we trained our deep Neural Networks (NN) to detect and mitigate biases of existing AI models. The backend of our bot uses the Detection Transformer (DETR) framework, developed by Facebook,more » « less
-
Many organizations seek to ensure that machine learning (ML) and artificial intelligence (AI) systems work as intended in production but currently do not have a cohesive methodology in place to do so. To fill this gap, we propose MLTE (Machine Learning Test and Evaluation, colloquially referred to as "melt"), a framework and implementation to evaluate ML models and systems. The framework compiles state-of-the-art evaluation techniques into an organizational process for interdisciplinary teams, including model developers, software engineers, system owners, and other stakeholders. MLTE tooling supports this process by providing a domain-specific language that teams can use to express model requirements, an infrastructure to define, generate, and collect ML evaluation metrics, and the means to communicate results.more » « less
-
Large language models (LLM) are perceived to offer promising potentials for automating security tasks, such as those found in security operation centers (SOCs). As a first step towards evaluating this perceived potential, we investigate the use of LLMs in software pentesting, where the main task is to automatically identify software security vulnerabilities in source code. We hypothesize that an LLM-based AI agent can be improved over time for a specific security task as human operators interact with it. Such improvement can be made, as a first step, by engineering prompts fed to the LLM based on the responses produced, to include relevant contexts and structures so that the model provides more accurate results. Such engineering efforts become sustainable if the prompts that are engineered to produce better results on current tasks, also produce better results on future unknown tasks. To examine this hypothesis, we utilize the OWASP Benchmark Project 1.2 which contains 2,740 hand-crafted source code test cases containing various types of vulnerabilities. We divide the test cases into training and testing data, where we engineer the prompts based on the training data (only), and evaluate the final system on the testing data. We compare the AI agent’s performance on the testing data against the performance of the agent without the prompt engineering. We also compare the AI agent’s results against those from SonarQube, a widely used static code analyzer for security testing. We built and tested multiple versions of the AI agent using different off-the-shelf LLMs – Google’s Gemini-pro, as well as OpenAI’s GPT-3.5-Turbo and GPT-4-Turbo (with both chat completion and assistant APIs). The results show that using LLMs is a viable approach to build an AI agent for software pentesting that can improve through repeated use and prompt engineering.more » « less
-
The advanced capabilities of Large Language Models (LLMs) have made them invaluable across various applications, from conversational agents and content creation to data analysis, research, and innovation. However, their effectiveness and accessibility also render them susceptible to abuse for generating malicious content, including phishing attacks. This study explores the potential of using four popular commercially available LLMs, i.e., ChatGPT (GPT 3.5 Turbo), GPT 4, Claude, and Bard, to generate functional phishing attacks using a series of malicious prompts. We discover that these LLMs can generate both phishing websites and emails that can convincingly imitate well-known brands and also deploy a range of evasive tactics that are used to elude detection mechanisms employed by anti-phishing systems. These attacks can be generated using unmodified or "vanilla" versions of these LLMs without requiring any prior adversarial exploits such as jailbreaking. We evaluate the performance of the LLMs towards generating these attacks and find that they can also be utilized to create malicious prompts that, in turn, can be fed back to the model to generate phishing scams - thus massively reducing the prompt-engineering effort required by attackers to scale these threats. As a countermeasure, we build a BERT-based automated detection tool that can be used for the early detection of malicious prompts to prevent LLMs from generating phishing content. Our model is transferable across all four commercial LLMs, attaining an average accuracy of 96% for phishing website prompts and 94% for phishing email prompts. We also disclose the vulnerabilities to the concerned LLMs, with Google acknowledging it as a severe issue. Our detection model is available for use at Hugging Face, as well as a ChatGPT Actions plugin.more » « less