White Paper · Proficiency Scoring Methodology

How Arrival Measures
What Students Know

A detailed methodology paper on Arrival's multi-source, trajectory-aware proficiency scoring engine — how it works, what corrections have been made, our confidence in those corrections, and what will change as more data becomes available.

Evidence Sources

Quizzes, Assessments, Work Uploads, Coaching

Recency Model

Exponential Decay (45-Day Half-Life)

Anomaly Detection

Adaptive 2σ Threshold

Confidence Intervals

Standard Error with t-Correction

Executive Summary

Arrival's proficiency engine converts raw evidence — quiz scores, assessment results, work uploads, and coaching interactions — into a single, defensible proficiency estimate for every student on every skill. That estimate is not a simple average. It is the output of a multi-layered pipeline that accounts for the reliability of each evidence source, the recency of each observation, the trajectory of a student's learning over time, and the statistical confidence we can place in the result.

Every score passes through a governance layer that enforces fairness thresholds, explainability requirements, and uncertainty constraints before it is surfaced to faculty or used to inform interventions. The system cannot act on uncertain predictions. When any constraint fails, it defers to human review.

The goal is not to assign a number. It is to produce a proficiency estimate that a faculty member can trust enough to act on — and that a student can trust enough to learn from.

This paper describes each stage of the pipeline, the recent corrections made to improve accuracy, the statistical confidence behind those corrections, and the directions we expect the model to evolve as more institutional data becomes available.

The Problem with Simple Averages

Traditional grading systems compute a mean or weighted mean of assignment scores. This approach has three fundamental accuracy problems that compound as a course progresses:

Source Blindness

A quiz answer and a coaching session exchange are treated with equal evidentiary weight. In reality, a controlled quiz provides far stronger evidence of mastery than an engagement signal from a coaching conversation. Simple averages cannot distinguish between these.

Temporal Flatness

A score from the first week of the semester carries the same weight as one from yesterday. For a student who has been improving steadily, this drags the estimate down. For a student who has been declining, it inflates the estimate. The average lags the truth in both directions.

False Precision

A student scored on two quiz questions receives a number that looks identical in format to a student scored on twenty. The average provides no mechanism to distinguish between “we're fairly certain” and “we're guessing.” Without confidence intervals, faculty cannot calibrate the urgency of an intervention.

Arrival's proficiency engine was built to solve all three problems simultaneously. Every score carries source weights, recency decay, and a confidence interval that tells faculty not just where a student stands — but how sure we are about it.

Multi-Source Evidence Architecture

The engine accepts evidence from four distinct source types, each assigned a reliability weight reflecting how strongly a single observation from that source indicates true mastery:

1.00

Quiz

Controlled assessment environment. Questions are calibrated to specific skills. Highest evidentiary reliability.

1.00

Assessment

Initial onboarding assessment. Broader in scope, establishes a baseline across all course skills.

0.75

Work Upload

Student-submitted work analyzed by AI. Rich evidence of applied mastery, but noisier due to open-ended format.

0.35

Coaching

Signals derived from coaching interactions. Engagement is a weak proxy for mastery. Lowest reliability.

Source-Normalized Weighting

A critical design decision is source normalization. Rather than weighting each individual observation independently (which allows high-volume, low-reliability sources to overwhelm the estimate), the engine first computes a recency-weighted average within each source, then combines those per-source averages by reliability weight.

Phase 1: Within-Source Averaging

For each source type, all observations are combined into a single recency-weighted average. If a student has ten coaching signals and one quiz score, the coaching signals produce one aggregate number — not ten individual weights competing against a single quiz.

Phase 2: Cross-Source Combination

The per-source averages are then combined using reliability weights. The quiz average (weight 1.0) and the coaching average (weight 0.35) are brought together proportionally. This ensures the final score reflects the quality of evidence, not just the quantity.

One strong quiz is worth more than ten coaching signals — and the algorithm ensures the math reflects that.

Exponential Recency Model

Learning is temporal. A student's proficiency today is better reflected by recent evidence than by observations from the start of the semester. The engine applies exponential recency decay with a 45-day half-life to every observation.

How It Works

Each observation's weight is multiplied by 2^-t/45, where t is the age of the observation in days. An observation from today has full weight. An observation from 45 days ago has half weight. An observation from 90 days ago has one-quarter weight.

This is a continuous, smooth decay — not a hard cutoff. Old observations still contribute; they simply cannot dominate the estimate when fresher evidence exists.

Why Exponential, Not Linear

The previous model used a linear decay function that flattened after 70 days, treating a 70-day-old score identically to a 700-day-old one. This created a “stale floor” where old evidence accumulated permanent influence regardless of age.

Exponential decay mirrors how learning actually works: recent performance is the strongest indicator of current capability, and the influence of older evidence fades naturally without artificial thresholds.

Recency Weight by Observation Age

100%

Today

80%

2 Weeks

50%

45 Days

25%

90 Days

6 Months

Trajectory Analysis

A point estimate tells you where a student is. Trajectory analysis tells you where they are going. The engine analyzes the sequence of scores over time to compute velocity, consistency, and a genuineness score that distinguishes real learning from noise.

Exponentially-Weighted Velocity

Rather than computing a simple average of all score changes, the engine weights recent changes more heavily (decay factor: 0.7 per step). A student who was declining but has recently turned around will show a positive velocity — the simple average would still show stagnation.

Adaptive Anomaly Detection

Instead of flagging any jump greater than a fixed 30-point threshold, the engine computes the standard deviation of the student's own score deltas and flags jumps beyond 2σ (with a minimum floor of 20 points). This adapts to each student's natural variability.

Continuous Genuineness Score

Learning genuineness is no longer a binary yes/no. It is a continuous 0–1 score combining directional consistency, velocity magnitude, and an anomaly penalty. A genuineness of 0.8 means strong, sustained learning. A score of 0.2 means noisy or suspicious improvement.

Trend Classification

Accelerating

> +5 pts/assessment

Strong upward trajectory with sustained, genuine improvement.

Improving

+1 to +5 pts

Steady gains. Consistency determines whether progress appears sustained or variable.

Stable

-1 to +1 pts

Scores are flat. No significant growth detected. May indicate a plateau or need for intervention.

Slipping

-5 to -1 pts

Moderate decline. Early intervention recommended before the trend deepens.

Declining

< -5 pts

Significant, sustained loss of proficiency. Immediate support needed.

Score Adjustment

When the trajectory is both genuine (genuineness ≥ 0.4) and backed by at least three data points, the engine applies a small momentum adjustment to the weighted score. The adjustment is proportional to both the velocity and the genuineness confidence:

Momentum = genuineness × velocity × 0.3

A student with genuineness 0.8 and velocity +6 receives a momentum boost of +1.4 points. A student with genuineness 0.3 (below the threshold) receives no adjustment regardless of velocity. This prevents lucky guesses or noisy spikes from inflating scores.

Conversely, suspicious positive anomalies trigger proportional dampening. A 30-point unexplained jump dampens the score by approximately 3%. A 60-point jump dampens by approximately 6%, capped at 10%. Only suspicious positive jumps are dampened — legitimate score drops are not penalized further.

Confidence Intervals

Instead of “you're at 72%,” Arrival says “you're at 72% (65–79).” Confidence intervals communicate the precision of the estimate and enable faculty to make appropriately calibrated decisions.

Standard Error with t-Correction

The interval width is derived from the standard error of the mean of all observations, multiplied by the appropriate t-value for a 90% confidence interval. For small samples (n ≤ 10), we use the Student's t-distribution rather than the normal approximation, which correctly produces wider intervals when data is sparse.

With one observation, the t-value is 6.314 — producing a wide interval that honestly communicates uncertainty. By ten observations, it narrows to 1.812, approaching the large-sample z-value of 1.645.

Source Quality Adjustment

The interval width is further adjusted by the proportion of high-reliability sources (quizzes and formal assessments) in the observation set. When 100% of evidence comes from controlled assessments, the interval narrows by up to 15%.

This reflects the practical reality that quiz scores have lower measurement error than coaching signals, independent of sample size.

Confidence Accumulation

Overall confidence (0–1) accumulates with diminishing returns. Each additional observation from a source adds less confidence than the previous one, following a geometric model:

Source Confidence = 1 − (1 − rate)ⁿ, where rate is the per-observation confidence contribution and n is the number of observations.

The first quiz contributes 15% toward full confidence. The second adds 12.75% (15% of the remaining 85%). The fifth adds 7.5%. This prevents the system from reaching false certainty on a narrow evidence base, while still rewarding breadth of evidence.

Reliability Labels

High

Confidence ≥ 80%, interval width ≤ 10 pts

Moderate

Confidence ≥ 50%, interval width ≤ 18 pts

Low

Below moderate thresholds

Governance and Fairness

Every proficiency score passes through a four-layer governance pipeline before it is persisted or surfaced to users. The system operates under the constraint that automated action is conditional, not assumed. When any constraint fails, the system defers to human review.

Input Validation

Verifies consent, provenance, and data integrity. Only necessary fields are propagated forward (feature minimization). A score without a verifiable student identity and skill mapping is rejected before computation begins.

Score Audit

Requires every score to include an evidence summary (explainability constraint). Warns when confidence falls below the threshold. A score that cannot explain itself is flagged as inadmissible.

Uncertainty Estimation

Computes the variance of the student's score distribution and checks it against the uncertainty threshold (δ = 0.25). Prevents the system from acting on unstable or overconfident predictions.

Decision Governance

Evaluates all constraints in hierarchy: Privacy ≻ Fairness ≻ Risk ≻ Confidence. Automated action requires all constraints to pass. Insufficient assessment history always defers to human judgment.

Fairness Monitoring

By design, Arrival does not collect race, gender, or demographic data. Fairness is monitored using proxy groupings — enrollment cohort, institution, and engagement level — to detect whether AI scoring produces inequitable score distributions. The fairness metric is the difference in pass rates (score ≥ 70) between groups, held to an ε = 0.10 threshold. A deviation exceeding 10% triggers an automatic flag for human review.

L4: Continuous Monitoring evaluates drift, anomalies, and fairness deviations across all scored assessments. Score changes exceeding 40 points between consecutive assessments for the same skill are flagged as anomalies for review, regardless of whether the governance pipeline approved the individual score.

AI Assessment Scoring

The raw skill scores that feed the proficiency engine are generated by Anthropic's Claude, operating under tightly constrained system prompts that enforce evidence-based scoring discipline.

Quiz & Assessment Scoring

Claude receives the questions, correct answers, the student's responses, and the skill definitions for the course. It scores each skill on a 0–100 rubric, provides a confidence rating based on how much evidence the questions provided, and writes an evidence summary citing specific answer quality.

0–30: Little to no demonstrated understanding

31–50: Partial understanding, significant gaps

51–70: Foundational competency, some gaps

71–85: Solid understanding, minor gaps

The newly computed score is now included in the trajectory history array before analysis. Additionally, database queries were corrected to reference the scored_at column (the actual column name) rather than created_at (which did not exist in the table schema).

Confidence Assessment of Corrections

Not all corrections carry equal certainty. Some are mathematically provable improvements. Others are well-grounded in statistical theory but will require empirical validation with production data. We assess each correction across three dimensions: theoretical soundness, expected impact magnitude, and remaining uncertainty.

Correction	Theoretical Confidence	Expected Impact	Remaining Uncertainty
Exponential recency decay	Very High	High	Optimal half-life (45 days) may need tuning per course duration. Semester-length courses may benefit from a shorter half-life than year-long sequences.
Source normalization	Very High	High	Minimal. Two-phase normalization is mathematically correct for preventing count imbalance. Source reliability weights may need calibration with empirical data.
Diminishing confidence	High	Medium	The geometric model is well-established. The per-observation rates (0.15 for quiz, 0.05 for coaching) are reasonable priors that should be validated against observed scoring consistency.
Adaptive anomaly detection	High	Medium	The 2σ threshold is standard in anomaly detection. With very few data points (n < 4), the estimated σ may be unreliable, which is why a 20-point minimum floor is enforced.
Standard-error CIs	Very High	Medium	The t-distribution is the correct theoretical framework for small-sample intervals. The only assumption is approximate normality of scores, which is reasonable for aggregated proficiency estimates.
Genuineness-scaled adjustments	Moderate	Medium	The 0.3 momentum coefficient is an informed prior but has not been empirically optimized. The genuineness score itself combines three factors with equal implicit weighting that may benefit from calibration.
Trajectory lag fix	Certain	High	None. This was a bug: the most recent score was excluded from trajectory analysis. The fix is definitionally correct.

Five of seven corrections are grounded in established statistical methods with very high theoretical confidence. One is a definitive bug fix. The remaining correction (genuineness-scaled adjustments) is the most heuristic and will be the first candidate for empirical refinement as production data accumulates.

Future Directions with More Data

The current engine is designed around strong priors and conservative defaults. As Arrival accumulates institutional data across pilot deployments, several components will evolve from informed estimates to empirically calibrated parameters. The following represent our highest-priority research directions.

Empirically Calibrated Source Weights

After 2–3 pilot semesters

Current source weights (quiz: 1.0, upload: 0.75, coaching: 0.35) are informed priors. With sufficient paired data — where students have scores from multiple source types for the same skill — we can compute the actual predictive validity of each source against end-of-course outcomes. If work uploads prove more predictive than quizzes for certain skill types, the weights will reflect that.

Course-Adaptive Recency Half-Life

After 1–2 pilot semesters

The 45-day half-life is a reasonable default for a standard 15-week semester. However, intensive summer courses, year-long sequences, and modular programs have different temporal dynamics. With cross-course data, we can fit optimal half-lives per course structure or allow the engine to learn the decay rate from observed score trajectories.

Item Response Theory (IRT) Integration

After 5+ pilot semesters

The current model treats all quiz questions as equally informative. IRT models each question's difficulty and discrimination, producing more precise proficiency estimates per question answered. This requires a large item bank with response data across many students. As our question bank grows through adaptive assessment generation, IRT integration becomes increasingly feasible and valuable.

Bayesian Knowledge Tracing (BKT)

After 3–5 pilot semesters

BKT models proficiency as a hidden state that evolves with each learning event, estimating the probability that a student has truly learned a skill versus performing correctly by chance. It naturally handles the "lucky guess" and "careless mistake" failure modes. Integrating BKT would replace the current trajectory adjustment heuristic with a principled probabilistic model of learning dynamics.

Multi-Dimensional Fairness Monitoring

After 3+ pilot semesters across institutions

Currently fairness is monitored using enrollment cohort as a proxy grouping. With data across multiple institutions, we can test for scoring equity across institution type (community college vs. research university), course format (online vs. in-person), and engagement pattern clusters — without ever collecting protected demographic attributes.

Genuineness Model Calibration

After 1–2 pilot semesters

The genuineness score currently combines consistency, velocity, and anomaly penalty with implicit equal weighting. With outcome data (did students with high genuineness scores actually retain their gains?), we can learn the optimal combination weights and potentially add new input signals such as time-between-assessments and practice frequency.

Predictive Confidence Intervals

After 2–3 pilot semesters

Current intervals describe uncertainty about the present estimate. With longitudinal data, we can produce predictive intervals: "based on students with similar trajectories, this student's score is likely to be between X and Y by end of term." This transforms the CI from a retrospective measurement into a forward-looking planning tool for faculty.

The Design Principle Behind Every Future Change

Every parameter in the current engine has been set to a defensible default. No parameter has been set by convenience. As data accumulates, each default can be replaced with an empirically learned value — but only when the evidence justifies the change with confidence exceeding the prior. This is not a system that will drift. It is a system designed to sharpen.

Conclusion

Arrival's proficiency engine is not a black box that assigns numbers. It is a transparent, governed pipeline where every score can be traced to its evidence sources, every estimate carries an honest confidence interval, and every automated decision is conditional on satisfying fairness, explainability, and uncertainty constraints.

The recent corrections addressed fundamental accuracy issues: stale evidence contaminating current estimates, source count imbalances distorting the signal, false confidence from narrow evidence, and a trajectory analysis that was systematically one observation behind. Five of seven corrections are grounded in well-established statistical methods. One was a definitive bug fix. The remaining heuristic component is clearly identified and scheduled for empirical calibration.

We do not claim this system is perfect. We claim it is honest about what it knows, transparent about how it knows it, and designed to get sharper with every semester of data it encounters. That is the standard faculty should expect from any system that presumes to measure what their students know.

Request access to continue.

Restricted content · 1-2 business day review