Artificial Intelligence Interactive Lab
Responsible RAG Chatbot Evaluation Studio
Students build and evaluate a university-policy assistant that must answer only from provided policy snippets. The focus is not just making a chatbot work, but proving citation quality, refusal behavior, and hallucination controls.
Auto-start Worker Lab Runtime
When this page loads, it automatically calls the Cloudflare Worker runtime for this topic. Students can immediately test the online API in the browser, Postman, or n8n. The downloadable native Python version remains available for local execution and Wireshark loopback capture.
Live runtime status
Starting Worker runtime…
- Running endpoint diagnostics…
Native Python Testing Kit
Run the lab locally with Python only, then validate it using Postman requests, n8n automation, and Wireshark traffic evidence.
Learning outcomes
0% checked
Instructor toolkit
Roles
RAG engineer, policy owner, evaluator, student user, ethics reviewer.
Free tools
Python notebook, scikit-learn TF-IDF retrieval, optional Ollama, local markdown policy snippets.
Metrics
Citation precision, answer faithfulness, refusal accuracy, context relevance, escalation correctness.
Governance lens
Transparency, accountability, privacy, human oversight, audit logs.
Hands-on station board
Run in teams30 min
1. Knowledge-base design
Station 1Prepare policy snippets so retrieval can be audited.
30 min
1. Knowledge-base design
- Assign each policy paragraph a source ID.
- Chunk by rule boundaries rather than arbitrary length only.
- Add metadata: policy area, date, owner, audience.
- Create five questions that should be answerable and five that should be refused.
Evidence:Policy corpus sheet with source IDs and test question set.
45 min
2. Retrieval and citation build
Station 2Implement or simulate retrieval before generation.
45 min
2. Retrieval and citation build
- Retrieve top matching snippets for a question.
- Force the answer template to cite source IDs.
- Display retrieved context before final answer.
- Log cases where retrieved context is weak or conflicting.
Evidence:Working notebook/prototype or retrieval simulation log.
45 min
3. Hallucination challenge bench
Station 3Attack the chatbot with ambiguous, missing-policy, and adversarial questions.
45 min
3. Hallucination challenge bench
- Test unsupported questions and verify refusal behavior.
- Identify intrinsic hallucination: answer contradicts retrieved source.
- Identify extrinsic hallucination: answer adds facts not present in source.
- Improve the prompt/control policy and retest.
Evidence:Evaluation matrix with before/after results.
30 min
4. Responsible deployment review
Station 4Decide whether the assistant is safe enough for student-facing use.
30 min
4. Responsible deployment review
- Define allowed and forbidden use cases.
- Add escalation paths for high-stakes questions.
- Write a transparency notice for users.
- Present risk controls to a mock academic board.
Evidence:Responsible AI deployment memo.
Policy snippet mini-corpus
| Source ID | Policy area | Snippet | Risk if misanswered |
|---|---|---|---|
| P01 | Attendance | Students must satisfy minimum attendance requirements defined by the course policy. | Medium |
| P02 | Assessment | Final grades combine assignments, midterm, final exam, and lecturer-approved participation. | High |
| P03 | Appeal | Grade appeals require documented evidence and submission within the stated academic window. | High |
| P04 | Privacy | Student academic data must not be exposed to unauthorized parties. | Critical |
| P05 | Scope | The assistant cannot approve exceptions; it can only explain published policy and direct users to staff. | Critical |
Copy-ready lab assets
Minimal retrieval baseline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Fit snippets, rank top-k, then answer only from returned source IDs.Answer contract
If the provided context does not contain the answer, say: "I cannot answer from the provided policy sources." Always include Source IDs.Self-check quiz
1. What makes a RAG answer more trustworthy?
Trust depends on source-grounded claims, not style.
2. Which test question is best for evaluating refusal behavior?
Unsupported or policy-creating prompts reveal whether the assistant refuses safely.
Assessment rubric
RAG architecture
30%Pipeline shows corpus, retrieval, context, answer, and citation trace.
Evaluation depth
30%Tests include answerable, unanswerable, conflicting, and adversarial cases.
Responsible AI controls
25%Refusal, escalation, transparency, and logging are specified.
Prototype clarity
15%Code/simulation is reproducible and easy to explain.
Student deliverables
- ✓Policy mini-corpus
- ✓RAG prototype or simulation
- ✓Hallucination evaluation matrix
- ✓Responsible deployment memo
- ✓Demo script
Deep lab purpose
This lab is designed to be hands-on and critical, not a generic chatbot demo. Students build or simulate the full RAG workflow, then attack it with evaluation questions to prove whether answers are grounded in the provided sources.
Scenario
A university wants a student-facing assistant that answers academic-policy questions. The assistant must not invent policy, approve exceptions, reveal private data, or answer beyond the official source snippets.
Students must answer:
- Which source supports each answer?
- What should the system do when the policy is missing?
- Which answer is faithful, partially supported, or hallucinated?
- What controls are needed before deployment?
Native Python + tool testing requirement
Students must run the provided native Python RAG API, test supported and unsupported questions in Postman, import the n8n workflow to create an evaluation record, and optionally capture the API traffic in Wireshark. The lab is considered complete only when citation behavior, refusal behavior, and evaluation logs are demonstrated with real API responses.
Required student artifacts
- Policy mini-corpus with source IDs and metadata.
- Retrieval experiment or simulation showing top matched snippets.
- Answer template that includes citations and refusal behavior.
- Hallucination evaluation matrix with answerable, unanswerable, ambiguous, and adversarial prompts.
- Responsible deployment memo covering transparency, escalation, privacy, and human oversight.
- Demo script showing both a successful answer and a safe refusal.
Lab depth extension
For advanced classes, students implement a lightweight retrieval baseline using TF-IDF or embeddings. For non-coding classes, students simulate retrieval manually using the mini-corpus and focus on evaluation, governance, and risk control.
Instructor facilitation notes
Do not let students treat a confident chatbot response as correct. Require source IDs and claim-by-claim validation. If an answer contains one unsupported sentence, it must be marked as a failure or partial failure.
Challenge questions:
- When should the assistant refuse instead of answer?
- What is the difference between intrinsic and extrinsic hallucination?
- How can citations create a false sense of trust?
- Which questions should be escalated to staff?
- What should the user transparency notice say?
Assessment emphasis
Strong submissions demonstrate retrieval traceability, rigorous evaluation, and responsible AI controls. Weak submissions only show a chatbot interface without proving that the answer is grounded.