adaptive-testing
使用项目响应理论 (IRT) 设计和实现自适应测试系统。在处理计算机自适应测试 (CAT)、心理测量评估、能力估计、问题校准、测试设计或 IRT 模型 (1PL/2PL/3PL) 时使用。涵盖 K-12、认证、分班和诊断评估的测试算法、停止规则、项目选择策略和实际实施模式。
安装 / 下载方式
TotalClaw CLI推荐
totalclaw install totalclaw:totalclaw~woodstocksoftware-adaptivetestcURL直接下载,无需登录
curl -fsSL https://skills.taituai.com/api/skills/totalclaw%3Atotalclaw~woodstocksoftware-adaptivetest/file -o woodstocksoftware-adaptivetest.md## 概述(中文) 使用项目响应理论 (IRT) 设计和实现自适应测试系统。在处理计算机自适应测试 (CAT)、心理测量评估、能力估计、问题校准、测试设计或 IRT 模型 (1PL/2PL/3PL) 时使用。涵盖 K-12、认证、分班和诊断评估的测试算法、停止规则、项目选择策略和实际实施模式。 ## 原文 # Adaptive Testing with IRT Design computerized adaptive tests that measure ability efficiently and accurately using Item Response Theory. ## Core Concept Adaptive tests adjust difficulty in real-time based on student responses. A correct answer → harder question. Incorrect → easier question. The result: accurate ability estimates in ~50% fewer questions than fixed-length tests. **Key advantage:** Traditional tests waste time on too-easy or too-hard questions. Adaptive tests spend time where measurement matters most — near the student's ability level. ## Quick Decision Tree | You need to... | See | |----------------|-----| | Understand IRT models and parameters | [IRT Fundamentals](#irt-fundamentals) | | Design a new adaptive test | [Test Design Workflow](#test-design-workflow) | | Choose item selection algorithm | [Item Selection](#item-selection-strategies) | | Decide when to stop the test | [Stopping Rules](#stopping-rules) | | Calibrate new questions | `references/calibration.md` | | Implement CAT algorithm | `references/implementation.md` | --- ## IRT Fundamentals ### The 3-Parameter Logistic (3PL) Model Most adaptive tests use the 3PL model. Each question has three parameters: - **a** (discrimination) — How well the question differentiates ability levels. Higher = steeper curve. Typical range: 0.5 to 2.5 - **b** (difficulty) — The ability level where P(correct) = 0.5. Range: -3 to +3 (standardized scale) - **c** (guessing) — Probability of guessing correctly. Usually 0.2 to 0.25 for multiple choice **Probability of correct response:** ``` P(correct | ability, a, b, c) = c + (1 - c) / (1 + e^(-a(ability - b))) ``` **Simpler models:** - **2PL:** Set c = 0 (no guessing parameter) - **1PL (Rasch):** Set c = 0 and a = 1 for all items (only difficulty varies) Use 3PL for high-stakes tests. Use 2PL/1PL when sample size is small (<500 responses per item). ### Information and Standard Error **Information** measures how precisely an item estimates ability at a given level. Peak information occurs when ability ≈ difficulty (b parameter). **Standard Error (SE)** is the inverse of information: ``` SE = 1 / sqrt(Information) ``` **Goal of CAT:** Maximize information (minimize SE) at the student's true ability level. --- ## Test Design Workflow ### 1. Define Test Specifications - **Purpose:** Placement, diagnostic, certification, progress monitoring? - **Content domain:** Single skill or multidimensional? - **Target population:** What ability range (-3 to +3)? - **Constraints:** Time limit, minimum/maximum length, content balance ### 2. Build Item Bank **Minimum bank size:** 10× the average test length. For a 20-item CAT, you need ≥200 calibrated items. **Distribution targets:** - Difficulty (b): Spread across expected ability range - Discrimination (a): Target 1.0 to 2.0 (high discrimination) - Exposure: No item used >20% of the time **Content balancing:** If testing math, ensure geometry/algebra/etc. are proportionally represented. ### 3. Choose Algorithms Pick one from each category: **Item selection:** (see below) - Maximum Information - Randomesque (MFI + exposure control) - Content balancing **Ability estimation:** - Maximum Likelihood Estimation (MLE) - Expected A Posteriori (EAP) — better for extreme scores - Weighted Likelihood (WLE) **Stopping rule:** (see below) - Fixed length - Standard error threshold - Information threshold ### 4. Simulate Performance Before going live, simulate 1000+ test sessions with known abilities. Check: - Average test length - SE at different ability levels - Item exposure rates - Content balance adherence Adjust if needed. --- ## Item Selection Strategies ### Maximum Fisher Information (MFI) **Rule:** Select the item with highest information at current ability estimate. **Pros:** Optimal precision, shortest tests **Cons:** Overuses "best" items, poor security **Use when:** Pilot testing, low-stakes practice ### Randomesque (MFI + Exposure Control) **Rule:** Select from top N items by information (e.g., top 5), choose randomly from that set. **Pros:** Balances precision and security **Cons:** Slightly longer tests than pure MFI **Use when:** Operational tests, default choice ### a-Stratified **Rule:** Start with high-discrimination items (high a), use mid-discrimination later. **Pros:** Fast initial ability estimate **Cons:** Complex to implement **Use when:** Very large item banks, research settings ### Content Balancing **Rule:** Track content area usage, prioritize underrepresented areas when selecting next item. **Implementation:** Weight information by content constraint satisfaction. **Use when:** Blueprint requirements, multidimensional tests --- ## Stopping Rules ### Fixed Length Stop after N items (e.g., 20 questions). **Pros:** Predictable time, simple **Cons:** May over/under-test some students **Use when:** Time limits matter, simple implementation needed ### Standard Error Threshold Stop when SE < target (e.g., SE < 0.3). **Pros:** Consistent precision across ability levels **Cons:** Variable test length (harder to schedule) **Typical targets:** - Low-stakes: SE < 0.4 - Medium-stakes: SE < 0.3 - High-stakes: SE < 0.25 **Use when:** Precision matters more than time ### Combined Rule Stop when (SE < target) OR (length ≥ max) OR (length ≥ min AND ability estimate stable). **Use when:** Production systems (safest approach) --- ## Practical Considerations ### Starting Ability Estimate **Options:** 1. Population mean (θ = 0) 2. Prior information (e.g., grade level, previous test) 3. First question is medium difficulty, estimate from there Never start at extremes (-3 or +3). ### Handling Extreme Response Patterns **All correct or all incorrect:** MLE fails. Use EAP or Bayesian prior to regularize. **Rapid changes:** If ability estimate jumps >1.0, consider response anomaly (cheating, guessing). ### Exposure Control Track how often each item is used. Flag items used >20% of the time. Consider: - Randomesque selection (above) - Sympson-Hetter method (advanced) - Periodic item bank refresh ### Multidimensional IRT (MIRT) If testing multiple skills (e.g., algebra + geometry), use separate ability estimates per dimension. Select items to balance information across dimensions. **Warning:** MIRT requires larger item banks and more complex calibration. --- ## Common Mistakes ❌ **Too few items in bank** → High exposure, security risk ✅ Aim for 10× average test length ❌ **Poorly distributed difficulties** → Accurate only in narrow ability range ✅ Spread items across -2 to +2 difficulty ❌ **Ignoring content balance** → May skip important topics ✅ Build content constraints into item selection ❌ **Using MLE for all incorrect** → Returns -∞ ✅ Use EAP or cap estimates at -3/+3 ❌ **No exposure control** → Same items every test ✅ Use randomesque or Sympson-Hetter --- ## When to Load References | Need | File | |------|------| | Calibrate new items (collect data, estimate parameters) | `references/calibration.md` | | Implement CAT algorithm (code patterns, libraries) | `references/implementation.md` | --- ## Real-World Example: K-12 Math Placement **Setup:** - Item bank: 300 questions, b from -2 (basic) to +2 (advanced) - Target: SE < 0.35 or max 25 questions - Content: 40% algebra, 30% geometry, 30% statistics - Algorithm: Randomesque (top 5), EAP estimation **Flow:** 1. Start at θ = 0 (grade-level average) 2. Select item: b ≈ 0, content area needed 3. Student answers → update ability estimate (EAP) 4. Select next: maximize information at new θ, respect content balance, randomesque from top 5 5. Stop when SE < 0.35 or 25 questions reached 6. Report: ability estimate + placement recommendation **Result:** Average 18 questions, 95% of students placed within ±0.5 grade levels of t