Interactive Deep Lab: Responsible RAG Chatbot

Artificial Intelligence Interactive Lab

Responsible RAG Chatbot Evaluation Studio

Students build and evaluate a university-policy assistant that must answer only from provided policy snippets. The focus is not just making a chatbot work, but proving citation quality, refusal behavior, and hallucination controls.

Duration: 150–180 minutesNative Python runnable labTest with Wireshark / Postman / n8n

Auto-start Worker Lab Runtime

When this page loads, it automatically calls the Cloudflare Worker runtime for this topic. Students can immediately test the online API in the browser, Postman, or n8n. The downloadable native Python version remains available for local execution and Wireshark loopback capture.

Live runtime status

Starting Worker runtime…

Running endpoint diagnostics…

Start / health API Open live data endpointWorks with Postman + n8n HTTP Request

Native Python Testing Kit

Run the lab locally with Python only, then validate it using Postman requests, n8n automation, and Wireshark traffic evidence.

README / run guide Local Postman collection Online Worker Postman Local n8n workflow Online Worker n8n Wireshark filters

Learning outcomes

Outcome 1:Explain the RAG pipeline: chunking, embeddings, retrieval, generation, citation, and evaluation.Outcome 2:Build a minimal local or notebook-based retrieval workflow using policy snippets.Outcome 3:Measure answer faithfulness, citation coverage, refusal quality, and unsupported-claim risk.Outcome 4:Design human-in-the-loop controls for responsible AI deployment in academic services.

0% checked

Instructor toolkit

Roles

RAG engineer, policy owner, evaluator, student user, ethics reviewer.

Free tools

Python notebook, scikit-learn TF-IDF retrieval, optional Ollama, local markdown policy snippets.

Metrics

Citation precision, answer faithfulness, refusal accuracy, context relevance, escalation correctness.

Governance lens

Transparency, accountability, privacy, human oversight, audit logs.

Hands-on station board

Run in teams

30 min

1. Knowledge-base design

Station 1

Prepare policy snippets so retrieval can be audited.

Assign each policy paragraph a source ID.
Chunk by rule boundaries rather than arbitrary length only.
Add metadata: policy area, date, owner, audience.
Create five questions that should be answerable and five that should be refused.

Evidence:Policy corpus sheet with source IDs and test question set.

45 min

2. Retrieval and citation build

Station 2

Implement or simulate retrieval before generation.

Retrieve top matching snippets for a question.
Force the answer template to cite source IDs.
Display retrieved context before final answer.
Log cases where retrieved context is weak or conflicting.

Evidence:Working notebook/prototype or retrieval simulation log.

45 min

3. Hallucination challenge bench

Station 3

Attack the chatbot with ambiguous, missing-policy, and adversarial questions.

Test unsupported questions and verify refusal behavior.
Identify intrinsic hallucination: answer contradicts retrieved source.
Identify extrinsic hallucination: answer adds facts not present in source.
Improve the prompt/control policy and retest.

Evidence:Evaluation matrix with before/after results.

30 min

4. Responsible deployment review

Station 4

Decide whether the assistant is safe enough for student-facing use.

Define allowed and forbidden use cases.
Add escalation paths for high-stakes questions.
Write a transparency notice for users.
Present risk controls to a mock academic board.

Evidence:Responsible AI deployment memo.

Policy snippet mini-corpus

Source ID	Policy area	Snippet	Risk if misanswered
P01	Attendance	Students must satisfy minimum attendance requirements defined by the course policy.	Medium
P02	Assessment	Final grades combine assignments, midterm, final exam, and lecturer-approved participation.	High
P03	Appeal	Grade appeals require documented evidence and submission within the stated academic window.	High
P04	Privacy	Student academic data must not be exposed to unauthorized parties.	Critical
P05	Scope	The assistant cannot approve exceptions; it can only explain published policy and direct users to staff.	Critical

Copy-ready lab assets

Minimal retrieval baseline

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Fit snippets, rank top-k, then answer only from returned source IDs.

Answer contract

If the provided context does not contain the answer, say: "I cannot answer from the provided policy sources." Always include Source IDs.

Self-check quiz

1. What makes a RAG answer more trustworthy?

Trust depends on source-grounded claims, not style.

2. Which test question is best for evaluating refusal behavior?

Unsupported or policy-creating prompts reveal whether the assistant refuses safely.

Assessment rubric

RAG architecture

30%

Pipeline shows corpus, retrieval, context, answer, and citation trace.

Evaluation depth

30%

Tests include answerable, unanswerable, conflicting, and adversarial cases.

Responsible AI controls

25%

Refusal, escalation, transparency, and logging are specified.

Prototype clarity

15%

Code/simulation is reproducible and easy to explain.

Student deliverables

✓Policy mini-corpus
✓RAG prototype or simulation
✓Hallucination evaluation matrix
✓Responsible deployment memo
✓Demo script

Deep lab purpose

This lab is designed to be hands-on and critical, not a generic chatbot demo. Students build or simulate the full RAG workflow, then attack it with evaluation questions to prove whether answers are grounded in the provided sources.

Scenario

A university wants a student-facing assistant that answers academic-policy questions. The assistant must not invent policy, approve exceptions, reveal private data, or answer beyond the official source snippets.

Students must answer:

Which source supports each answer?
What should the system do when the policy is missing?
Which answer is faithful, partially supported, or hallucinated?
What controls are needed before deployment?

Native Python + tool testing requirement

Students must run the provided native Python RAG API, test supported and unsupported questions in Postman, import the n8n workflow to create an evaluation record, and optionally capture the API traffic in Wireshark. The lab is considered complete only when citation behavior, refusal behavior, and evaluation logs are demonstrated with real API responses.

Required student artifacts

Policy mini-corpus with source IDs and metadata.
Retrieval experiment or simulation showing top matched snippets.
Answer template that includes citations and refusal behavior.
Hallucination evaluation matrix with answerable, unanswerable, ambiguous, and adversarial prompts.
Responsible deployment memo covering transparency, escalation, privacy, and human oversight.
Demo script showing both a successful answer and a safe refusal.

Lab depth extension

For advanced classes, students implement a lightweight retrieval baseline using TF-IDF or embeddings. For non-coding classes, students simulate retrieval manually using the mini-corpus and focus on evaluation, governance, and risk control.

Instructor facilitation notes

Do not let students treat a confident chatbot response as correct. Require source IDs and claim-by-claim validation. If an answer contains one unsupported sentence, it must be marked as a failure or partial failure.

Challenge questions:

When should the assistant refuse instead of answer?
What is the difference between intrinsic and extrinsic hallucination?
How can citations create a false sense of trust?
Which questions should be escalated to staff?
What should the user transparency notice say?

Assessment emphasis

Strong submissions demonstrate retrieval traceability, rigorous evaluation, and responsible AI controls. Weak submissions only show a chatbot interface without proving that the answer is grounded.