White Paper · Proficiency Scoring Methodology

How Arrival Measures
What Students Know

A detailed methodology paper on Arrival's multi-source, trajectory-aware proficiency scoring engine — how it works, what corrections have been made, our confidence in those corrections, and what will change as more data becomes available.

Evidence Sources
Quizzes, Assessments, Work Uploads, Coaching
Recency Model
Exponential Decay (45-Day Half-Life)
Anomaly Detection
Adaptive 2σ Threshold
Confidence Intervals
Standard Error with t-Correction
1

Executive Summary

Arrival's proficiency engine converts raw evidence — quiz scores, assessment results, work uploads, and coaching interactions — into a single, defensible proficiency estimate for every student on every skill. That estimate is not a simple average. It is the output of a multi-layered pipeline that accounts for the reliability of each evidence source, the recency of each observation, the trajectory of a student's learning over time, and the statistical confidence we can place in the result.

Every score passes through a governance layer that enforces fairness thresholds, explainability requirements, and uncertainty constraints before it is surfaced to faculty or used to inform interventions. The system cannot act on uncertain predictions. When any constraint fails, it defers to human review.

The goal is not to assign a number. It is to produce a proficiency estimate that a faculty member can trust enough to act on — and that a student can trust enough to learn from.

This paper describes each stage of the pipeline, the recent corrections made to improve accuracy, the statistical confidence behind those corrections, and the directions we expect the model to evolve as more institutional data becomes available.

2

The Problem with Simple Averages

Traditional grading systems compute a mean or weighted mean of assignment scores. This approach has three fundamental accuracy problems that compound as a course progresses:

Source Blindness

A quiz answer and a coaching session exchange are treated with equal evidentiary weight. In reality, a controlled quiz provides far stronger evidence of mastery than an engagement signal from a coaching conversation. Simple averages cannot distinguish between these.

Temporal Flatness

A score from the first week of the semester carries the same weight as one from yesterday. For a student who has been improving steadily, this drags the estimate down. For a student who has been declining, it inflates the estimate. The average lags the truth in both directions.

False Precision

A student scored on two quiz questions receives a number that looks identical in format to a student scored on twenty. The average provides no mechanism to distinguish between “we're fairly certain” and “we're guessing.” Without confidence intervals, faculty cannot calibrate the urgency of an intervention.

Arrival's proficiency engine was built to solve all three problems simultaneously. Every score carries source weights, recency decay, and a confidence interval that tells faculty not just where a student stands — but how sure we are about it.

3

Multi-Source Evidence Architecture

The engine accepts evidence from four distinct source types, each assigned a reliability weight reflecting how strongly a single observation from that source indicates true mastery:

1.00
Quiz

Controlled assessment environment. Questions are calibrated to specific skills. Highest evidentiary reliability.

1.00
Assessment

Initial onboarding assessment. Broader in scope, establishes a baseline across all course skills.

0.75
Work Upload

Student-submitted work analyzed by AI. Rich evidence of applied mastery, but noisier due to open-ended format.

0.35
Coaching

Signals derived from coaching interactions. Engagement is a weak proxy for mastery. Lowest reliability.

Source-Normalized Weighting

A critical design decision is source normalization. Rather than weighting each individual observation independently (which allows high-volume, low-reliability sources to overwhelm the estimate), the engine first computes a recency-weighted average within each source, then combines those per-source averages by reliability weight.

Phase 1: Within-Source Averaging

For each source type, all observations are combined into a single recency-weighted average. If a student has ten coaching signals and one quiz score, the coaching signals produce one aggregate number — not ten individual weights competing against a single quiz.

Phase 2: Cross-Source Combination

The per-source averages are then combined using reliability weights. The quiz average (weight 1.0) and the coaching average (weight 0.35) are brought together proportionally. This ensures the final score reflects the quality of evidence, not just the quantity.

One strong quiz is worth more than ten coaching signals — and the algorithm ensures the math reflects that.
4

Exponential Recency Model

Learning is temporal. A student's proficiency today is better reflected by recent evidence than by observations from the start of the semester. The engine applies exponential recency decay with a 45-day half-life to every observation.

How It Works

Each observation's weight is multiplied by 2-t/45, where t is the age of the observation in days. An observation from today has full weight. An observation from 45 days ago has half weight. An observation from 90 days ago has one-quarter weight.

This is a continuous, smooth decay — not a hard cutoff. Old observations still contribute; they simply cannot dominate the estimate when fresher evidence exists.

Why Exponential, Not Linear

The previous model used a linear decay function that flattened after 70 days, treating a 70-day-old score identically to a 700-day-old one. This created a “stale floor” where old evidence accumulated permanent influence regardless of age.

Exponential decay mirrors how learning actually works: recent performance is the strongest indicator of current capability, and the influence of older evidence fades naturally without artificial thresholds.

Recency Weight by Observation Age

100%
Today
80%
2 Weeks
50%
45 Days
25%
90 Days
6%
6 Months
5

Trajectory Analysis

A point estimate tells you where a student is. Trajectory analysis tells you where they are going. The engine analyzes the sequence of scores over time to compute velocity, consistency, and a genuineness score that distinguishes real learning from noise.

Exponentially-Weighted Velocity

Rather than computing a simple average of all score changes, the engine weights recent changes more heavily (decay factor: 0.7 per step). A student who was declining but has recently turned around will show a positive velocity — the simple average would still show stagnation.

Adaptive Anomaly Detection

Instead of flagging any jump greater than a fixed 30-point threshold, the engine computes the standard deviation of the student's own score deltas and flags jumps beyond 2σ (with a minimum floor of 20 points). This adapts to each student's natural variability.

Continuous Genuineness Score

Learning genuineness is no longer a binary yes/no. It is a continuous 0–1 score combining directional consistency, velocity magnitude, and an anomaly penalty. A genuineness of 0.8 means strong, sustained learning. A score of 0.2 means noisy or suspicious improvement.

Trend Classification

Accelerating
> +5 pts/assessment

Strong upward trajectory with sustained, genuine improvement.

Improving
+1 to +5 pts

Steady gains. Consistency determines whether progress appears sustained or variable.

Stable
-1 to +1 pts

Scores are flat. No significant growth detected. May indicate a plateau or need for intervention.

Slipping
-5 to -1 pts

Moderate decline. Early intervention recommended before the trend deepens.

Declining
< -5 pts

Significant, sustained loss of proficiency. Immediate support needed.

Score Adjustment

When the trajectory is both genuine (genuineness ≥ 0.4) and backed by at least three data points, the engine applies a small momentum adjustment to the weighted score. The adjustment is proportional to both the velocity and the genuineness confidence:

Momentum = genuineness × velocity × 0.3

A student with genuineness 0.8 and velocity +6 receives a momentum boost of +1.4 points. A student with genuineness 0.3 (below the threshold) receives no adjustment regardless of velocity. This prevents lucky guesses or noisy spikes from inflating scores.

Conversely, suspicious positive anomalies trigger proportional dampening. A 30-point unexplained jump dampens the score by approximately 3%. A 60-point jump dampens by approximately 6%, capped at 10%. Only suspicious positive jumps are dampened — legitimate score drops are not penalized further.

6

Confidence Intervals

Instead of “you're at 72%,” Arrival says “you're at 72% (65–79).” Confidence intervals communicate the precision of the estimate and enable faculty to make appropriately calibrated decisions.

Standard Error with t-Correction

The interval width is derived from the standard error of the mean of all observations, multiplied by the appropriate t-value for a 90% confidence interval. For small samples (n ≤ 10), we use the Student's t-distribution rather than the normal approximation, which correctly produces wider intervals when data is sparse.

With one observation, the t-value is 6.314 — producing a wide interval that honestly communicates uncertainty. By ten observations, it narrows to 1.812, approaching the large-sample z-value of 1.645.

Source Quality Adjustment

The interval width is further adjusted by the proportion of high-reliability sources (quizzes and formal assessments) in the observation set. When 100% of evidence comes from controlled assessments, the interval narrows by up to 15%.

This reflects the practical reality that quiz scores have lower measurement error than coaching signals, independent of sample size.

Confidence Accumulation

Overall confidence (0–1) accumulates with diminishing returns. Each additional observation from a source adds less confidence than the previous one, following a geometric model:

Source Confidence = 1 − (1 − rate)n, where rate is the per-observation confidence contribution and n is the number of observations.

The first quiz contributes 15% toward full confidence. The second adds 12.75% (15% of the remaining 85%). The fifth adds 7.5%. This prevents the system from reaching false certainty on a narrow evidence base, while still rewarding breadth of evidence.

Reliability Labels

High
Confidence ≥ 80%, interval width ≤ 10 pts
Moderate
Confidence ≥ 50%, interval width ≤ 18 pts
Low
Below moderate thresholds
7

Governance and Fairness

Every proficiency score passes through a four-layer governance pipeline before it is persisted or surfaced to users. The system operates under the constraint that automated action is conditional, not assumed. When any constraint fails, the system defers to human review.

L0
Input Validation

Verifies consent, provenance, and data integrity. Only necessary fields are propagated forward (feature minimization). A score without a verifiable student identity and skill mapping is rejected before computation begins.

L1
Score Audit

Requires every score to include an evidence summary (explainability constraint). Warns when confidence falls below the threshold. A score that cannot explain itself is flagged as inadmissible.

L2
Uncertainty Estimation

Computes the variance of the student's score distribution and checks it against the uncertainty threshold (δ = 0.25). Prevents the system from acting on unstable or overconfident predictions.

L3
Decision Governance

Evaluates all constraints in hierarchy: Privacy ≻ Fairness ≻ Risk ≻ Confidence. Automated action requires all constraints to pass. Insufficient assessment history always defers to human judgment.

Fairness Monitoring

By design, Arrival does not collect race, gender, or demographic data. Fairness is monitored using proxy groupings — enrollment cohort, institution, and engagement level — to detect whether AI scoring produces inequitable score distributions. The fairness metric is the difference in pass rates (score ≥ 70) between groups, held to an ε = 0.10 threshold. A deviation exceeding 10% triggers an automatic flag for human review.

L4: Continuous Monitoring evaluates drift, anomalies, and fairness deviations across all scored assessments. Score changes exceeding 40 points between consecutive assessments for the same skill are flagged as anomalies for review, regardless of whether the governance pipeline approved the individual score.

8

AI Assessment Scoring

The raw skill scores that feed the proficiency engine are generated by Anthropic's Claude, operating under tightly constrained system prompts that enforce evidence-based scoring discipline.

Quiz & Assessment Scoring

Claude receives the questions, correct answers, the student's responses, and the skill definitions for the course. It scores each skill on a 0–100 rubric, provides a confidence rating based on how much evidence the questions provided, and writes an evidence summary citing specific answer quality.

0–30: Little to no demonstrated understanding
31–50: Partial understanding, significant gaps
51–70: Foundational competency, some gaps
71–85: Solid understanding, minor gaps
86–100: Strong mastery demonstrated

Work Upload Assessment

For student-submitted work (essays, projects, problem sets), Claude analyzes the submission against all course skills, returning a score only for skills where the work provides direct evidence. Skills without evidence receive null rather than a zero — ensuring the absence of evidence is not treated as evidence of absence.

The system prompt enforces rigor: “High scores require clear demonstration, not just mention.” Every assessment includes specific citations from the student's work to maintain full traceability.

Claude's raw scores are the starting point, not the final answer. They enter the multi-source weighting pipeline as one observation among many, subject to the same recency decay, source normalization, trajectory analysis, and governance constraints as every other evidence source.

9

Recent Corrections and Improvements

The proficiency engine has undergone seven targeted corrections to improve scoring accuracy. Each correction addresses a specific, identified failure mode in the previous implementation.

01

Linear Recency Decay Replaced with Exponential Decay

High Impact
Problem

The previous linear decay function (1 - ageDays × 0.01) produced a flat minimum weight of 0.3 after 70 days. A score from 70 days ago had the same influence as a score from 700 days ago, creating permanent contamination of the estimate by stale evidence.

Correction

Replaced with exponential decay using a 45-day half-life: 2^(-t/45). Influence decays smoothly and continuously. A 6-month-old score carries only 6% weight, naturally fading without an arbitrary cutoff.

02

Source Count Imbalance Eliminated via Normalization

High Impact
Problem

Each observation was weighted independently. A student with 10 coaching signals and 1 quiz saw coaching dominate the final score despite coaching's low reliability weight (0.35), because raw accumulation overwhelmed the source weight multiplier.

Correction

Introduced two-phase computation: within-source recency-weighted average first, then cross-source combination by reliability weight. Each source now contributes one aggregate signal, regardless of how many individual observations it contains.

03

Linear Confidence Replaced with Diminishing Returns

Medium Impact
Problem

Confidence accumulated linearly: 0.15 per quiz observation, uncapped until hitting 1.0. The 10th quiz added the same 15% as the 1st, and 7 quizzes alone could reach full confidence — creating false certainty on a narrow evidence base.

Correction

Switched to geometric accumulation: 1 - (1 - rate)^n. The first quiz adds 15%. The fifth adds 7.5%. The tenth adds 3.9%. Full confidence now requires either extensive data from a single reliable source or corroborating evidence across multiple source types.

04

Fixed Anomaly Threshold Replaced with Adaptive Detection

Medium Impact
Problem

Any score change exceeding 30 points was flagged as anomalous. For a student with naturally high variability (e.g., strong quiz scores but weak uploads), normal fluctuations triggered false anomaly flags. For a student with very consistent scores, a suspicious 25-point jump went undetected.

Correction

Anomaly threshold is now computed per-student as 2σ from their mean delta, with a 20-point minimum floor. The system adapts to each student's natural variability rather than applying a universal cutoff.

05

Heuristic Confidence Intervals Replaced with Statistical CIs

Medium Impact
Problem

The previous interval used an arbitrary base half-width of 25 points that narrowed by a hand-tuned formula. The resulting intervals bore no statistical relationship to the actual variance of the observed data.

Correction

Intervals are now computed from the standard error of the mean, multiplied by the appropriate t-value for the sample size. For n=1, the wide t-value (6.314) produces an honest interval. For n=10+, the interval narrows toward the normal approximation. Source quality provides an additional continuous narrowing factor.

06

Trajectory Adjustments Scaled by Genuineness Confidence

Medium Impact
Problem

The previous model applied a flat velocity × 0.5 boost whenever isGenuine was true and velocity exceeded 3. Anomaly dampening was a blanket 5% applied to all anomalies regardless of direction or severity. This produced both over-boosting of noisy upward trends and unnecessary penalization of legitimate score drops.

Correction

The boost is now genuineness × velocity × 0.3, where genuineness is a continuous 0–1 score. Dampening is proportional to anomaly severity and applied only to suspicious positive jumps. Requires ≥3 data points before any trajectory adjustment is applied.

07

Trajectory Lag Bug Fixed

High Impact
Problem

When a new score was submitted, the trajectory analysis received only the historical scores — excluding the score just computed. This meant trajectory analysis was always one observation behind, missing the most recent and most relevant data point.

Correction

The newly computed score is now included in the trajectory history array before analysis. Additionally, database queries were corrected to reference the scored_at column (the actual column name) rather than created_at (which did not exist in the table schema).

10

Confidence Assessment of Corrections

Not all corrections carry equal certainty. Some are mathematically provable improvements. Others are well-grounded in statistical theory but will require empirical validation with production data. We assess each correction across three dimensions: theoretical soundness, expected impact magnitude, and remaining uncertainty.

CorrectionTheoretical ConfidenceExpected ImpactRemaining Uncertainty
Exponential recency decayVery HighHighOptimal half-life (45 days) may need tuning per course duration. Semester-length courses may benefit from a shorter half-life than year-long sequences.
Source normalizationVery HighHighMinimal. Two-phase normalization is mathematically correct for preventing count imbalance. Source reliability weights may need calibration with empirical data.
Diminishing confidenceHighMediumThe geometric model is well-established. The per-observation rates (0.15 for quiz, 0.05 for coaching) are reasonable priors that should be validated against observed scoring consistency.
Adaptive anomaly detectionHighMediumThe 2σ threshold is standard in anomaly detection. With very few data points (n < 4), the estimated σ may be unreliable, which is why a 20-point minimum floor is enforced.
Standard-error CIsVery HighMediumThe t-distribution is the correct theoretical framework for small-sample intervals. The only assumption is approximate normality of scores, which is reasonable for aggregated proficiency estimates.
Genuineness-scaled adjustmentsModerateMediumThe 0.3 momentum coefficient is an informed prior but has not been empirically optimized. The genuineness score itself combines three factors with equal implicit weighting that may benefit from calibration.
Trajectory lag fixCertainHighNone. This was a bug: the most recent score was excluded from trajectory analysis. The fix is definitionally correct.
Five of seven corrections are grounded in established statistical methods with very high theoretical confidence. One is a definitive bug fix. The remaining correction (genuineness-scaled adjustments) is the most heuristic and will be the first candidate for empirical refinement as production data accumulates.
11

Future Directions with More Data

The current engine is designed around strong priors and conservative defaults. As Arrival accumulates institutional data across pilot deployments, several components will evolve from informed estimates to empirically calibrated parameters. The following represent our highest-priority research directions.

Empirically Calibrated Source Weights

After 2–3 pilot semesters

Current source weights (quiz: 1.0, upload: 0.75, coaching: 0.35) are informed priors. With sufficient paired data — where students have scores from multiple source types for the same skill — we can compute the actual predictive validity of each source against end-of-course outcomes. If work uploads prove more predictive than quizzes for certain skill types, the weights will reflect that.

Course-Adaptive Recency Half-Life

After 1–2 pilot semesters

The 45-day half-life is a reasonable default for a standard 15-week semester. However, intensive summer courses, year-long sequences, and modular programs have different temporal dynamics. With cross-course data, we can fit optimal half-lives per course structure or allow the engine to learn the decay rate from observed score trajectories.

Item Response Theory (IRT) Integration

After 5+ pilot semesters

The current model treats all quiz questions as equally informative. IRT models each question's difficulty and discrimination, producing more precise proficiency estimates per question answered. This requires a large item bank with response data across many students. As our question bank grows through adaptive assessment generation, IRT integration becomes increasingly feasible and valuable.

Bayesian Knowledge Tracing (BKT)

After 3–5 pilot semesters

BKT models proficiency as a hidden state that evolves with each learning event, estimating the probability that a student has truly learned a skill versus performing correctly by chance. It naturally handles the "lucky guess" and "careless mistake" failure modes. Integrating BKT would replace the current trajectory adjustment heuristic with a principled probabilistic model of learning dynamics.

Multi-Dimensional Fairness Monitoring

After 3+ pilot semesters across institutions

Currently fairness is monitored using enrollment cohort as a proxy grouping. With data across multiple institutions, we can test for scoring equity across institution type (community college vs. research university), course format (online vs. in-person), and engagement pattern clusters — without ever collecting protected demographic attributes.

Genuineness Model Calibration

After 1–2 pilot semesters

The genuineness score currently combines consistency, velocity, and anomaly penalty with implicit equal weighting. With outcome data (did students with high genuineness scores actually retain their gains?), we can learn the optimal combination weights and potentially add new input signals such as time-between-assessments and practice frequency.

Predictive Confidence Intervals

After 2–3 pilot semesters

Current intervals describe uncertainty about the present estimate. With longitudinal data, we can produce predictive intervals: "based on students with similar trajectories, this student's score is likely to be between X and Y by end of term." This transforms the CI from a retrospective measurement into a forward-looking planning tool for faculty.

The Design Principle Behind Every Future Change

Every parameter in the current engine has been set to a defensible default. No parameter has been set by convenience. As data accumulates, each default can be replaced with an empirically learned value — but only when the evidence justifies the change with confidence exceeding the prior. This is not a system that will drift. It is a system designed to sharpen.

12

Conclusion

Arrival's proficiency engine is not a black box that assigns numbers. It is a transparent, governed pipeline where every score can be traced to its evidence sources, every estimate carries an honest confidence interval, and every automated decision is conditional on satisfying fairness, explainability, and uncertainty constraints.

The recent corrections addressed fundamental accuracy issues: stale evidence contaminating current estimates, source count imbalances distorting the signal, false confidence from narrow evidence, and a trajectory analysis that was systematically one observation behind. Five of seven corrections are grounded in well-established statistical methods. One was a definitive bug fix. The remaining heuristic component is clearly identified and scheduled for empirical calibration.

We do not claim this system is perfect. We claim it is honest about what it knows, transparent about how it knows it, and designed to get sharper with every semester of data it encounters. That is the standard faculty should expect from any system that presumes to measure what their students know.

Request access to continue.

Restricted content · 1-2 business day review

We review requests within 1-2 business days. Your information is not shared with third parties.