By Marc-André Léger, PhD
This article presents a proof of concept (POC) for a python-based local LLM database that can import locally stored documents used to enhance query results. The project is part of a larger project called CyberRisk Guardian, which includes the use of local LLMs to be used in the field of cybersecurity. This project, and related articles, have been published previously on LinkedIn and ResearchGate.
The python code is available here: https://github.com/ITriskMgr/CyberRiskGuardian
Introduction
A Large Language Model (LLM) is an advanced artificial intelligence (AI) system specifically designed to understand, generate, and interact with human language. Built using machine learning techniques, often through deep learning architectures like transformers, LLMs are trained on massive datasets of text from diverse sources such as books, articles, and websites. This training allows them to learn patterns, grammar, context, and even some degree of world knowledge.
LLMs excel in several key areas. They are capable of processing and comprehending complex sentences, paragraphs, and entire documents, discerning context, intent, and subtleties in human language. In addition to understanding language, LLMs can generate coherent and contextually relevant responses, whether that involves writing essays, answering questions, or creating code. These models are highly versatile and can perform tasks such as summarization, translation, text classification, and question answering, often without requiring additional training. The term « large » in LLM refers to the scale of these models, which can range from millions to hundreds of billions of parameters. This size enables them to generalize effectively and produce high-quality results.
The functioning of LLMs is rooted in a robust training process. These models are trained on vast datasets by predicting the next word in a sequence, using probability distributions derived from the patterns in their training data. Most modern LLMs, such as GPT or BERT, use transformer-based architectures, which are particularly adept at capturing relationships between words and phrases over long contexts. This capability allows them to understand nuances and generate logical, contextually appropriate text. After their initial training, LLMs can be fine-tuned for specific tasks or specialized domains, such as legal, medical, or cybersecurity language.
Several prominent LLMs are widely used today. OpenAI’s GPT series, including GPT-4, is known for its versatility and broad applicability. Google’s BERT is optimized for understanding sentence-level relationships and is often employed in search engines. Meta’s LLaMA focuses on efficiency while maintaining high performance, and Anthropic’s Claude emphasizes safe and ethical interactions. Mistral offers compact models designed for deployment in constrained environments.
LLMs are applied across various domains. In customer support, they automate chatbots to provide personalized, 24/7 assistance. They are also used for content creation, generating articles, reports, and summaries. In education, LLMs enable adaptive learning systems and tutoring tools, while in software development, they assist in writing and debugging code or generating technical documentation. In healthcare, LLMs help summarize medical reports and support diagnostic decision-making. In cybersecurity, they assist in threat analysis, incident response, and policy creation, as demonstrated in this project.
Despite their strengths, LLMs face challenges. They are resource-intensive, requiring significant computational power and storage for training and operation. Bias and ethical issues can arise, as LLMs may inherit biases from their training data, potentially leading to misinformation or harmful content generation. Additionally, while LLMs have broad knowledge, their expertise in specific areas often requires fine-tuning or external verification to ensure reliability.
LLMs are transformative tools capable of performing complex language-related tasks, revolutionizing industries and enhancing productivity. However, their responsible deployment and usage are critical to addressing their limitations and mitigating ethical concerns.
RAG-enhances LLM queries
A Retrieval-Augmented Generation (RAG)-enhanced LLM query combines the natural language processing capabilities of a Large Language Model (LLM) with the precision and relevance of information retrieval systems. This hybrid approach is designed to enhance the quality of responses by integrating external knowledge from structured or unstructured data sources, such as databases or document repositories. Unlike a standalone LLM, which relies solely on its pre-trained knowledge, a RAG-enhanced query dynamically retrieves contextually relevant information to improve the accuracy and depth of its answers.
The RAG process works through two main components: retrieval and generation. In the retrieval phase, the system identifies and retrieves relevant documents or data from a vector database based on the query input. These databases are typically created by processing and embedding documents into vector representations, which allow the system to perform similarity searches efficiently. Once the relevant content is retrieved, it is passed to the LLM, which generates a response by synthesizing the retrieved information with its pre-trained language capabilities. This approach ensures that the output is not only contextually accurate but also informed by the latest or domain-specific knowledge stored in the database.
RAG-enhanced queries are particularly useful in scenarios where the LLM alone might lack up-to-date or domain-specific knowledge. For instance, in a cybersecurity context, a consultant might query the system to define a technical concept or retrieve insights from a repository of research papers and policies. The RAG pipeline dynamically fetches the most relevant documents and integrates their content into the response, making the system both adaptive and precise.
The efficiency of RAG-enhanced queries depends heavily on the quality of the retrieval mechanism and the structure of the vector database. During the retrieval phase, a similarity-based search algorithm is used to rank documents based on their relevance to the query. The vector database, populated with embeddings of text chunks, ensures that the most contextually aligned information is retrieved. These embeddings are generated using specialized models, which convert textual content into numerical representations that capture semantic meaning.
The advantages of RAG-enhanced LLM queries are evident in their performance. By combining the generalization capabilities of LLMs with the specificity of retrieval systems, the approach bridges the gap between pre-trained knowledge and real-time, domain-specific information. This makes it ideal for applications such as research assistance, customer support, and technical analysis, where accurate and up-to-date information is crucial.
However, RAG-enhanced queries are not without challenges. The retrieval system must be carefully optimized to ensure the relevance and quality of the retrieved data. A poorly designed vector database or ineffective retrieval algorithm can lead to irrelevant or inaccurate results. Additionally, integrating the retrieved data with the LLM’s generative capabilities requires careful orchestration to ensure coherence and contextual relevance in the output.
Thus, RAG-enhanced LLM queries represent a powerful fusion of retrieval systems and language generation models. By leveraging external knowledge repositories, they provide responses that are both informed and contextually precise. This makes them an invaluable tool for professionals and organizations seeking to enhance their decision-making and problem-solving processes with AI-driven insights.
Description of this project
This project represents a proof-of-concept (PoC) for a Retrieval-Augmented Generation (RAG)-enhanced local LLM application. It is designed to provide professionals with a robust, secure, and customizable solution for querying locally stored knowledge repositories. By combining the retrieval capabilities of a vector database with the language generation power of a Large Language Model (LLM), this project allows users to interact with their own datasets efficiently, enabling the generation of accurate, contextually relevant responses.
At its core, the system integrates several components to create an end-to-end RAG pipeline. The pipeline processes and indexes a collection of documents into a vector database using embeddings generated by the selected LLM. These embeddings enable the system to perform similarity-based searches, ensuring that queries retrieve the most relevant data. Once retrieved, the system uses the LLM to synthesize the retrieved content with its pre-trained knowledge, generating responses tailored to the user’s needs.
The application was built in a Python 3.10 environment and runs on PyCharm Professional Edition, leveraging the Ollama LLM platform for language generation and the Chroma vector database for information retrieval. It operates on a datastore of over 6,000 documents, including PDFs and text files, compiled over years of research and professional work in cybersecurity and business technology management.
The workflow begins with the ingestion of documents, which are then split into manageable chunks using a text-splitting strategy that ensures contextual continuity. These chunks are embedded and stored in the vector database, enabling efficient retrieval. The pipeline also includes a GUI built with tkinter
, allowing users to interact with the system and submit queries in real time.
Objectives of the Project
The project aims to address a common challenge faced by professionals: extracting relevant and actionable insights from large, unstructured datasets. By integrating a retrieval mechanism, the system allows users to query their personal document repositories, ensuring that responses are both accurate and grounded in reliable sources. For example, a cybersecurity consultant might query the system for a definition or specific regulation, and the system would retrieve relevant documents while generating a coherent, citation-ready response.
Key Features
- Document Processing and Indexing: The system ingests and preprocesses large numbers of documents, converting them into vector representations that can be efficiently searched.
- Vector Database: A local Chroma vectorstore allows for similarity-based retrieval of the most relevant information, ensuring privacy and security by keeping data entirely local.
- Customizable LLM Integration: The application supports various LLMs through the Ollama platform, enabling users to select models based on their performance and resource requirements.
- Interactive Query Interface: A
tkinter
-based GUI offers a user-friendly environment for submitting queries and viewing results. - Scalability: The system accommodates datasets ranging from a few documents to thousands, ensuring flexibility for different professional needs.
- Efficient RAG Pipeline: The retrieval system dynamically enhances the LLM’s generative outputs, combining stored domain knowledge with the LLM’s language capabilities.
Performance Testing
The application was tested with varying numbers of documents and LLM models to evaluate its performance. Results demonstrated efficient vectorization and retrieval, with performance rates ranging from 0.21 to 2.63 vectors per second, depending on the model and dataset size. Smaller datasets and lighter models yielded faster responses, while more complex models required longer runtimes.
Applications
This project is ideal for professionals in fields like cybersecurity, business management, law, and medicine, where timely and accurate information retrieval is critical. Users can leverage the system to:
- Generate reports and summaries.
- Retrieve and analyze domain-specific documents.
- Support decision-making with AI-enhanced insights.
Challenges and Future Directions
While the project successfully demonstrates the utility of a RAG-enhanced LLM, there are challenges to address:
- Scalability: Processing large datasets with resource-intensive models requires optimization to reduce runtime and memory usage.
- Model Integration: Expanding compatibility with additional LLMs can provide users with more options for customization.
- Improved GUI: Enhancing the user interface can make the application more accessible to non-technical users.
Future iterations could focus on distributed processing, cloud-based vector storage for larger datasets, and fine-tuning LLMs for domain-specific tasks.
This project showcases the potential of combining retrieval systems and language generation models to create powerful, domain-specific AI tools. By leveraging personal knowledge repositories, the system empowers users to efficiently access and generate actionable insights, transforming how professionals interact with their data.
Query used in this POC
I am an information security and cybersecurity consultant working for a large international consulting firm. I need your help to provide material for a client presentation. Please provide a general definition of cybersecurity risk management. Include citations in APA format.
System Architecture
The application follows a modular design and employs the following technologies:
- Document Processing:
- Extracts text from
.pdf
and.txt
files usingPyPDF2
and basic file operations. - Text Splitting:
- Splits large documents into manageable chunks using the
RecursiveCharacterTextSplitter
. - Vector Store Initialization:
- Creates a vector database using Chroma, embedding documents with Ollama’s LLM.
- RAG Pipeline Setup:
- Combines a retriever with Ollama’s LLM for retrieval-augmented generation tasks.
- Interactive Querying:
- Provides a GUI interface using
tkinter
for real-time queries.
Application Environment
The application runs in a Python 3.10 environment and integrates with the Ollama LLM platform. The environment setup includes:
- IDE: PyCharm 2024.3.1 Professional Edition
- Database: A local datastore with 6418 PDF and text documents, representing years of accumulated knowledge in Business Technology Management and cybersecurity.
Before running tests, users must ensure the ./chroma_data
directory is cleared to maintain consistent performance.
Vector store performance
Table 1 presents a summary of all the tests that were performed.
No | Docs | Model | Vectors | Time (sec) | Vectors/s |
1 | 20 | llama3.1:latest | 2218 | 967 | 2.29 |
2 | 20 | llama3.1:latest | 6905 | 3646 | 1.89 |
3 | 20 | llama3.1:latest | 1053 | 484 | 2.18 |
4 | 20 | llama3.1:latest | 1398 | 597 | 2.34 |
5 | 20 | llama3.1:latest | 1669 | 821 | 2.03 |
6 | 20 | llama3.1:70b | 2900 | 13564 | 0.21 |
7 | 20 | llama3:70b-instruct-q8_0 | 5568 | 667230 (estimated) | 0.0083 |
8 | 20 | mistral:latest | 1404 | 1011 | 1.39 |
9 | 20 | llama3:latest | 866 | 363 | 2.39 |
10 | 5 | llama3.1:latest | 100 | 38 | 2.63 |
11 | 100 | llama3.1:latest | 18028 | 10538 | 1.71 |
12 | 500 | llama3.1:latest | 78174 | 35767 | 2.19 |
13 | 1000 | llama3.1:latest | 170038 | 89086 | 1.91 |
Table 1: test summary
The testing process evaluated the performance of the RAG-enhanced local LLM application across multiple scenarios. Each test varied by the number of documents processed, the model used, and the overall efficiency of the vector store in terms of speed and capacity. The performance was assessed based on the number of vectors created, the time taken for processing, and the vectors generated per second.
- Test 1: Using 20 documents with the llama3.1:latest model, the system processed 2,218 vectors in 967 seconds, achieving a rate of 2.29 vectors per second.
- Test 2: With the same model and 20 documents, the system created 6,905 vectors in 3,646 seconds, resulting in a slightly slower rate of 1.89 vectors per second.
- Test 3: Again using llama3.1:latest and 20 documents, the system processed 1,053 vectors in 484 seconds, delivering a rate of 2.18 vectors per second.
- Test 4: Processing 1,398 vectors in 597 seconds, the system achieved 2.34 vectors per second with the same setup.
- Test 5: The system generated 1,669 vectors in 821 seconds, maintaining a rate of 2.03 vectors per second.
- Test 6: Switching to the llama3.1:70b model, the system processed 2,900 vectors in 13,564 seconds. This resulted in a much lower efficiency of 0.21 vectors per second.
- Test 7: Using the llama3:70b-instruct-q8_0 model, the test was halted due to excessive runtime. The estimated number of vectors was 5,568 with a projected rate of 0.0083 vectors per second.
- Test 8: The mistral:latest model processed 1,404 vectors in 1,011 seconds, achieving 1.39 vectors per second.
- Test 9: Back to llama3:latest, the system processed 866 vectors in 363 seconds at a rate of 2.39 vectors per second, the highest recorded efficiency.
- Test 10: With only 5 documents, the system processed 100 vectors in 38 seconds, achieving an exceptional rate of 2.63 vectors per second.
- Test 11: Scaling up to 100 documents, the system generated 18,028 vectors in 10,538 seconds, resulting in a rate of 1.71 vectors per second.
- Test 12: Processing 500 documents, the system created 78,174 vectors in 35,767 seconds, achieving a rate of 2.19 vectors per second.
- Test 13: With 1,000 documents, the system processed 170,038 vectors in 89,086 seconds, maintaining a rate of 1.91 vectors per second.
Observations
- Highest Efficiency: The fastest rate was observed in Test 10, where only 5 documents were processed, reaching 2.63 vectors per second.
- Scaling Challenges: As the number of documents increases, the processing time also rises significantly, with diminishing returns in efficiency.
- Model Variations: Different models had a notable impact on performance. For instance, llama3.1:70b was slower compared to llama3.1:latest, while mistral:latest offered reasonable performance.
- Incomplete Test: Test 7 highlights the challenges of using resource-intensive models, which can lead to impractical runtimes.
Conclusion
The results indicate that the application performs efficiently with smaller datasets and lighter models. However, for larger datasets or more complex models, optimization is necessary to maintain usability and reduce runtime. These findings underscore the need for strategic model selection and dataset preparation to balance performance and scalability effectively.
Appendix 1: Test results
Test 1
Here is a general definition of cybersecurity risk management, along with some relevant sources:
Cybersecurity risk management refers to the process of identifying, assessing, and mitigating potential security threats to an organization’s digital assets, systems, and data (ISA/IEC 62443-2-4, 2018; NIST Cybersecurity Framework, 2017).
This process involves identifying potential risks, evaluating their likelihood and impact, prioritizing the most critical risks, and implementing controls to mitigate or manage those risks (ISACA COBIT, 2020; NIST Special Publication 800-37, 2014).
In other words, cybersecurity risk management is about understanding the potential threats and vulnerabilities that could compromise an organization’s security posture and taking proactive steps to prevent, detect, and respond to these threats.
References:
ISACA (2020). COBIT 2019: Guidelines for Information Security Management. ISACA.
ISA/IEC (2018). ISA/IEC 62443-2-4:2018 – Security for industrial automation and control systems part 2-4: Security program requirements. International Society of Automation.
NIST (2014). NIST Special Publication 800-37, Revision 1: Risk Management Guide for Information Technology Systems. National Institute of Standards and Technology.
NIST (2017). Framework for Improving Critical Infrastructure Cybersecurity. National Institute of Standards and Technology.
Test 6
Here’s a general definition of cybersecurity risk management:
Cybersecurity risk management refers to the process of identifying, assessing, and mitigating potential cyber threats and vulnerabilities that could impact an organization’s information assets, operations, and reputation (ISO/IEC 27001, 2013). It involves a systematic approach to managing cyber risks, including threat assessment, vulnerability analysis, and implementation of controls and countermeasures to reduce the likelihood and impact of a security breach (NIST Special Publication 800-30, 2012).
Effective cybersecurity risk management requires a comprehensive understanding of an organization’s business operations, information assets, and technology infrastructure, as well as the potential threats and vulnerabilities that could affect them (COBIT 5, 2012). It also involves continuous monitoring and review of cyber risks to ensure that the organization’s risk management strategy remains effective and up-to-date.
References:
COBIT 5. (2012). A Business Framework for the Governance and Management of Enterprise IT. ISACA.
ISO/IEC 27001. (2013). Information technology – Security techniques – Information security management systems – Requirements.
NIST Special Publication 800-30. (2012). Guide for Conducting Risk Assessments.
Note: The citations provided are in APA format, and the references listed include international standards and guidelines for cybersecurity risk management.
Test 8
Cybersecurity risk management is the process of identifying, assessing, and prioritizing risks to an organization’s information assets and implementing measures to mitigate or eliminate those risks (NIST, 2018). This process involves understanding the threat landscape, evaluating the vulnerabilities within an organization’s systems and networks, and determining appropriate responses based on the potential impact of a cybersecurity event (ISACA, 2013).
References:
NIST. (2018). Risk management framework for information systems and organizations. Retrieved from https://nvlpubs.nist.gov/nist.bibs.html/NIST.SP.800-53Ar4.pdf
ISACA. (2013). COBIT 5: Enabling the Transformation to the New IT. ISACA. Retrieved from https://www.isaca.org/COBIT5/Pages/default.aspx
Test 9
I can’t provide a definition of cybersecurity risk management that is specific to your client, as that would require access to confidential information about the client’s business or organization. However, I can provide a general definition of cybersecurity risk management that you can use as a starting point for your presentation.
Cybersecurity risk management refers to the process of identifying, assessing, and mitigating potential risks to an organization’s digital assets and operations. This includes understanding the likelihood and potential impact of various threats, such as data breaches, ransomware attacks, and other forms of cybercrime (Coutorie, 1995). It also involves implementing measures to prevent or detect these threats, as well as having plans in place for responding to incidents when they occur (Jordan & Taylor, 1998).
References:
Coutorie, L.E. (1995). The future of high-technology crime: A parallel Delphi study. Journal of Criminal Justice, 23(1), 13-27.
Goldstein, J. (1999c). The hacker subculture and the rise of the cyberpunk movement. In G. W. Wood (Ed.), Cyberpunk: The electronic revolution in science fiction and computer culture (pp. 1-14). New York: Routledge.
Jordan, A., & Taylor, P. (1998). Imagining the self: An exploration of the hacker subculture. Journal of Communication, 48(4), 763-785.
Test 10
Answer with 5 random documents
100 vectors 00:38 seconds
According to the ISACA (2018) guidelines, « cybersecurity risk management » refers to the process of identifying, assessing, and prioritizing risks to an organization’s information assets, systems, and processes. This involves analyzing potential threats, vulnerabilities, and impacts, as well as implementing controls and mitigations to reduce or eliminate these risks (ISACA, 2018).
Similarly, the National Institute of Standards and Technology (NIST) defines cybersecurity risk management as « the process used to identify, assess, prioritize, and mitigate risks to an organization’s information assets » (NIST, 2020, p. 5).
In addition, a study by Sasse et al. (2001) highlights the importance of considering human factors in cybersecurity risk management, noting that « users’ perceptions of risk are influenced by factors such as trust, convenience, and perceived benefits » (Sasse et al., 2001, p. 12).
References:
ISACA. (2018). COBIT 2019. Retrieved from <https://www.isaca.org/COBIT-5>
NIST. (2020). Cybersecurity Framework. Retrieved from <https://www.nist.gov/cyberframework>
Sasse, M. A., Brodbeck, D., & Gumprecht, S. F. (2001). The effect of user experience on the acceptance of security measures. Computers in Human Behavior, 17(2), 145-164. doi: 10.1016/S0749-5974(00)00051-7
Test 11
Here is the definition you requested:
Cybersecurity risk management refers to the process of identifying, assessing, and mitigating potential risks to an organization’s digital assets and operations (Whitman & Mattord, 2016). This includes evaluating threats from various sources, such as malware, phishing attacks, and unauthorized access, and implementing controls to prevent or minimize their impact. Cybersecurity risk management is a critical component of overall information security and helps organizations protect themselves against cyber-attacks, data breaches, and other types of cyber-related incidents.
Reference:
Whitman, J. G., & Mattord, H. (2016). Managing and Leading in Information Technology: A Practical Guide for IT Professionals. Cengage Learning.
Note that I’ve provided a general definition of cybersecurity risk management, which is relevant to the context you provided. If you need more specific information or details related to cloud computing or data privacy, please let me know!
Test 12: Answers with 500 documents
Result:
According to the provided context, Cybersecurity risk management for small businesses is defined as « a practice of limiting risk, loss, and vulnerability » (Review of the Literature).
A more comprehensive definition can be derived from the discussion on Information Security Governance. The CERT defines information security governance as the process of establishing and maintaining a framework and supporting management structure and processes to provide assurance that information security strategies are aligned with and support business objectives.
However, for a specific focus on cybersecurity risk management, you might want to look at Cook (2017) who discussed the costs associated with cyberattacks. While not providing a direct definition, it highlights the importance of limiting risk, loss, and vulnerability in this context.
For a general definition of cybersecurity risk management, I would propose:
Cybersecurity risk management is the practice of identifying, assessing, and mitigating potential threats to an organization’s information security by implementing measures to prevent or minimize losses, vulnerabilities, and risks associated with cyberattacks (Cook, 2017; Bonnay, 2021).
References:
Bonnay, J. (2021). Cybersecurity risk management in small businesses: A review of the literature. Journal of Business and Economic Research, 10(2), 34-44.
Cook, C. (2017). The costs associated with cyberattacks. Journal of Cybersecurity, 3(1), 1-8.
Note that while I’m providing a definition based on the given context, it’s essential to ensure accuracy by referencing and understanding the broader literature on cybersecurity risk management.
Test 13
Here’s a general definition of cybersecurity risk management:
Cybersecurity risk management is the process of identifying, assessing, and mitigating potential security risks to an organization’s information assets (Whittaker & Swimmer, 1997). It involves understanding the likelihood and potential impact of various threats, such as cyber-attacks, data breaches, or system failures, and implementing controls to reduce these risks. This includes using a risk management framework, such as NIST Cybersecurity Framework (NIST CSF), to categorize and prioritize risks based on their level of threat, vulnerability, and impact.
References:
Whittaker, J., & Swimmer, M. (1997). The art of security risk analysis. Prentice Hall.
Note: The APA citation format requires the use of a comma after the publication year, so I’ve added it accordingly.