4. Score normalization technical details We now discuss the technical details of the score normalization method.

Size: px

Start display at page:

Download "4. Score normalization technical details We now discuss the technical details of the score normalization method."

Darrell Jason Norris
6 years ago
Views:

1 SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules Then, we give a recise technical descrition of the scoring system 1 Summary of scoring changes We will normalize the scores on each test by a new method based on simultaneously estimating the relative erformance of each contestant and the difficulty of each roblem from the actual results of the tournament This changes the way overall team scores are comuted to make them fairer, by making scores on different tests more comarable However, it does not affect the rankings on each searate test For test-taking strategy, the imortant oint is that your score on each test still deends only on the number of roblems you solve correctly So the goal remains simle: solve as many roblems as you can Scoring rules On the individual tests and the Team Test, each roblem is worth one oint towards the raw score (See the following sections for why these are the raw scores) On the Power Round, roblems are weighted, and the oint value of each roblem towards the score will be indicated on the test While scoring of each test is straightforward, determining how much each test should contribute to a team s overall score is a more comlicated roblem For each test excet the Power Round, the scores are normalized by the score normalization rocedure described in the following sections This converts each raw score to a normalized score Each normalized score is then multilied by a weight deending on the test, to roduce a weighted score The weights determine how much each test is worth comared to the others The weight for the individual tests (both general and subject tests) is 50, and the weight for the team test and ower round is 400 The weighted scores on the Power round are simly the raw scores, divided by the maximum score ossible score, and multilied by the weight of 400 A team s overall score is the sum of the weighted scores on the team test, the ower round, and all of the team members individual tests So, individual tests count for about 50% of a team s overall score, and the team and ower tests each count for about 5% Note that the easier general test is worth aroximately 50% as much as two subject tests 3 Overview of score normalization 31 Motivation Our scoring needs to roduce two end results: a ranking of the cometitors (individuals or teams) on each test, and an overall score for each team, which is a sum of contributions from the tests the team took We therefore wish to assign a score to each cometitor on each test which reresents their relative erformance on that test These 1

2 scores are the contributions to the overall team scores, and also determine the rankings on each test Because the overall team scores are comosed of contributions from several different tests, the scores need to be normalized to be comarable across tests Using the number of roblems solved correctly or a sum of oints awarded for each roblem as the score is clearly not a good choice The most obvious examle is that individuals can take different subject tests or the general test, which vary in difficulty Without any normalization, there would be a bias in favor of easier subject tests 3 Score normalization method The basic idea of our method for normalizing scores on each test is that we simultaneously estimate the relative erformance of each contestant and the difficulty of each roblem based on the actual results on the test We then estimate the number of roblems each contestant would solve on an idealized model test from the estimate of their relative erformance This is the contestant s normalized score Note that scores on each test are normalized indeendently of the other tests A little more technically, we model the robability of a correct/incorrect resonse by each cometitor on each roblem as a function of arameters for the cometitor s ability and the roblem s difficulty We estimate these arameters by finding the values that best fit the results of the tournament We then consider a model test where the roblem difficulties are distributed according to a fixed robability distribution Using the estimates of a cometitor s ability, we comute the robability of the cometitor solving a roblem on this test This exected score is the normalized score we use The overall team scores are sums of these scores, and this score normalization method yields scores that are more comarable between tests and more fair as contributions to the overall team scores One roerty of this model is that a cometitor s normalized score deends only on the number of roblems they solved correctly, not on which roblems they solved (This follows from the technical details, as described in the following sections) Of course, the normalized score is higher for cometitors who solve more roblems So, desite the comlexity of the scoring system, the goal of each cometitor is still simly to solve as many roblems as they can Also, the rankings of cometitors on each test simly deend on the number of roblems they solved correctly 33 Advantages In revious years, we normalized scores by dividing the scores on each test by the to ten average This does hel comared to just using raw, unnormalized scores, but these scores are still not very fair For instance, this does not normalize for differences in variance Consider a simle examle: if two tests have the same to ten average but different variances, scores on the high-variance test become more imortant for overall team scores Another flaw with this normalization scheme is that it focuses on normalizing scores for the to ten cometitors, but does little to ensure that scores for cometitors in other score brackets are accurately normalized The new scoring method hels resolve these flaws and other issues It takes into account asects of the score distribution other than just the to ten average, roducing scores that are accurately normalized in all score brackets The only real disadvantage of the new scoring system is the greatly increased comlexity involved in calculating overall team scores However, this comlexity has little direct imact

3 on contestants, because the goal when taking a test remains essentially unchanged believe that the imrovements in the fairness of scoring outweigh the loss in simlicity We 4 Score normalization technical details We now discuss the technical details of the score normalization method 41 Model To normalize scores on a test, we model the robability of the resonse of each cometitor on each roblem Our model has the following arameters: the ability score α c of each cometitor c and the difficulty δ of each roblem We model the robability that a cometitor with a given ability score α c correctly solves a roblem with a given difficulty δ as g(α c δ ), where g(x) = 1-arameter item resonse theory model) We will comute the MAP estimate, and assign each cometitor s score based on the MAP estimate of their ability score α c For each cometitor c, α c is a riori distributed as an (imroer) uniform distribution over R (or, equivalently, there is no rior) The roblem difficulties δ are a riori distributed as ex is a logistic function (This is a Rasch model, or a 1+e x P ( δ) = ex ( (δ ) δ) over R (with aroriate normalization), where δ is the mean of the δ, and σ δ = 5 This is a Gaussian around δ for each roblem We assume that the observed results on each roblem by each cometitor are conditionally indeendent of each other given the arameters Thus, the likelihood function is P (D α, δ) = g(α c δ ) (1 g(α c δ )) 4 Parameter estimation We wish to maximize the osterior robability P ( α, δ D) = P (D α, δ)p ( α, δ) P (D) Droing the constant P (D) and normalization constants, this is equivalent to maximizing g(α c δ ) ex ( (δ ) δ) (1 g(α c δ )) This is equivalent to maximizing its logarithm log g(α c δ ) + log (1 g(α c δ )) 1 ( δ δ ) σδ For convenience, let l( α, δ) = be the log-likelihood log g(α c δ ) + 3 log (1 g(α c δ ))

4 Note that the osterior robability is invariant under an arbitrary translation of all the arameters Our model does not choose any origin for the scale on which the arameters α c and δ are located We will discuss this later We can given an equivalent definition of this otimization roblem Suose we relace the rior on δ with ex ( (δ ) µ δ ) where µ δ is a constant This relaces the term (δ δ) with (δ µ δ ) We claim that a solution maximizing the new log osterior l 1 σδ (δ µ δ ) has δ = µ δ This is because given any solution, we can add a constant a to all the arameters without changing the likelihood, so we wish to set a to minimize (δ + a µ δ ) Take d da (δ + a µ δ ) = δ + a µ δ = 0 So the otimum is at a = µ δ δ So, a solution with δ µ δ cannot be a maximum, because we can increase the log osterior by shifting the arameters such that δ = µ δ Because of this, it is easy to see that maximizing l 1 σδ (δ µ δ ) is equivalent to maximizing l 1 σδ (δ δ), u to a translation of all the arameters So, we now wish to maximize F ( α, δ) = log g(α c δ ) + log (1 g(α c δ )) 1 (δ µ δ ) We take the artial derivatives and set them equal to 0 Note that d g(x) = g(x) (1 g(x)) dx For any cometitor c, taking the artial derivative with resect to α c yields 0 = F = (1 g(α c δ )) g(α c δ ) α c = r c g(α c δ ) where r c is the number of roblems that c answered correctly For any roblem, taking the artial derivative with resect to δ yields 0 = F = (1 g(α c δ )) + g(α c δ ) δ µ δ δ σδ = S + c g(α c δ ) δ µ δ σ δ where S is the number of contestants that answered correctly Note from the equation for α c that the score α c deends only on the number of roblems c solved correctly, given fixed difficulty arameters δ So all contestants that solve the same number of roblems have the same score, no matter which roblems they solved (In other words, r c is a sufficient statistic) 4

5 43 An estimation algorithm Next, we describe an iterative algorithm for finding the maximum (This is a hard EM algorithm, and can also be viewed as a coordinate ascent algorithm) Note that F α c deends only on the arameters α c and δ, and is monotonic in α c So given δ, we can solve the equation for α c by binary search Similarly, F δ deends only on δ and α, and is monotonic in δ, so we can solve its equation by binary search given α We can start by choosing an arbitrary α (0) Given a arameter vector α (n), we can set δ (n) as the solution to the equations for δ given α (n), and then set α (n+1) as the solution to the equations for α c given δ (n) Note that each ste is guaranteed to increase F 44 Concavity We now show that F is concave Because of this, we can see that our algorithm is guaranteed to converge to the unique global maximum Proosition 41 F is concave Proof We will first show that log g(α δ) and log (1 g(α δ)) are concave functions The Hessian of log g(α δ) is ( ) h(α δ) h(α δ) H = h(α δ) h(α δ) where h(α δ) = (1 + e α δ ) Note that h(x) 0 for all x For any z, ( ) 1 1 z T Hz = h(α δ)z T z = h(α δ) ( z z 1 z + z) = h(α δ)(z1 z ) 0 Therefore, H is negative semidefinite, and log g(α δ) is concave The Hessian of log (1 g(α δ)) is the same, so this function is also concave The terms of F associated with the riors are single-variable quadratics which are obviously concave Since F is a sum of concave functions, F is concave 45 Comuting scores We have described how we comute estimates of α c for each contestant (u to a translation of all the arameters, which we will discuss in the next section) Next, we consider a model test where the roblem difficulties are distributed as a Gaussian with mean 0 and standard deviation σ = 5: P (δ ) = 1 σ π ex eα δ ( δ σ We will comute the exected number of roblems a contestant with ability α c solves on this test This is just the robability of solving a roblem times the number of roblems, so from now on we will consider the model test to have one roblem The robability of solving a roblem distributed as P (δ) is g(α c δ)p (δ) dδ This value is the normalized score of cometitor c 5 )

6 Note that the normalized score is in [0, 1], with a normalized score of 0 when the cometitor solves no roblems, and a normalized score of 1 when the cometitor solves all the roblems 46 Setting the scale origin We now return to the issue of choosing a articular translation of the arameter estimates, because translating all the arameter estimates by a constant does not change the osterior robability in the scoring model We wish to set a scale origin for the arameters of each test in such a way that scores are comarable across tests There is no canonical way to do this, because we do not have a way to directly comare scores between tests One aroach is to set the scale origin by assuming that the roblem difficulties δ are distributed similarly on the tests Consider taking Gaussian riors on δ with mean µ δ = 0 These are the riors we used for the uroses of arameter estimation, which we found gave us an equivalent MAP estimate, u to translation by a constant This is equivalent to setting δ = 0 Note that this centers the distribution the same location as the model test s difficulty distribution Another aroach is to base our choice on an assumtion that the contestant scores are distributed similarly on the tests Consider taking an average of the contestant scores, and setting that equal to a fixed value In articular, we will translate the arameters such that the mean normalized score of the middle half of contestants (between the first and third quartiles) is equal to a constant µ s The aroach based on the mean roblem difficulty works reasonably well in general, yielding normalized scores that reflect the distribution of roblem difficulties on the test, unlike the raw scores The general and team tests have very different contestant oulations, and thus score distributions, comared to each other and to the subject tests Because of this, it does not make much sense to choose the scale origin for the general and team tests based on the contestant scores In contrast, for the subject tests, the assumtion that the contestant scores are similarly distributed is fairly accurate Using the contestant-based aroach yields scores that are more accurate across the subject than just setting δ = 0, because the distribution of roblem difficulties does differ significantly between subject tests We therefore set the mean of the middle half of contestants to µ s = 0 This value is chosen emirically based on ast results This method works to give scores on the subject tests that are fair and comarable between different subjects, even when they have different roblem difficulties The resulting arameter estimates tend to have δ close to 0, so they are still comarable to the general and team test scores 6

On split sample and randomized confidence intervals for binomial proportions

On split sample and randomized confidence intervals for binomial proportions On slit samle and randomized confidence intervals for binomial roortions Måns Thulin Deartment of Mathematics, Usala University arxiv:1402.6536v1 [stat.me] 26 Feb 2014 Abstract Slit samle methods have