Testbase Standardised Tests – Standard Setting

Information for schools about the standard setting process

To set the standards for our tests, we have used an industry-standard and widely used approach (based on the Angoff Method – more details in the Appendix). We used a combination of reference points to set the ‘working at the expected standard’ thresholds for each one of our tests:

Expert judgement from a panel of primary curriculum experts and practising teachers as to which skills and knowledge would be expected to be secure at the point of assessment, i.e. a criterion-referenced standard.
The mark distribution for each test, taking into account the percentage of pupils who have achieved the expected standard in previous tests and trials, i.e. a norm-referenced standard.

These thresholds and the mark data, were then used to calculate standardised scaled scores, in a similar style to the Key Stage 2 national curriculum tests. A scaled score of 100 indicates the threshold for ‘working at the expected standard for this point in the academic year’. This means that a pupil who achieves a scaled score of 100 is just at this threshold.

Download a free sample

Order today

Interpreting scaled scores

Scaled score

Interpretation

None

If no scaled score is given, this is usually because a pupil scored very few marks and so we can’t reliably give a scaled score

< 80

Scores indicate that pupils might not currently be working within year group expectations

80 – 99

Scores within this band indicate that pupils could still need to secure knowledge and skills before we can be confident that they are meeting the expected standard

100

Scores indicate that pupils could be on track to meet the expected standard for their academic year

101 – 110

Scores within this band indicate that pupils are increasingly secure in their knowledge and understanding and are expected to meet the expected standard for their academic year

≥ 110

A scaled score of 110 or higher may indicate that these pupils could be working at the ‘higher standard’; although this is the score used for the Y6 SATs, we recommend caution in making that judgement – see the discussion below.

≥ 120

These pupils are performing at the ≥ 80th – 90th percentile (i.e. in the top ~10-20% of pupils; depending on the test). This is a much more robust indicator that pupils are working at the ‘higher standard’, and they will need activities to stretch and challenge if they are to continue to progress

Please note: The above applies to Key Stage 2 only. For Key Stage 1, there might be greater variation in the scaled scores due to the small number of marks on the papers combined with the fact that pupils of this age are still learning underlying skills.

Statistically, we would expect approximately 2 out of 3 pupils to have a scaled score between 85 and 115. However, unlike the SATs, we will report scaled scores beyond the 80 – 120 range to ensure teachers have the fullest information available to them. Our ‘floor’ and ‘ceiling’ are 70 and 130, though you should bear in mind that scaled scores at these extremes are less meaningful.

Teachers can therefore use the scaled scores from tests throughout the academic year to help monitor pupil progress, i.e. they can see whether or not a pupil remains within the expected standard from test to test and if they are making expected progress over the year.

Scaled score conversion tables can be downloaded from the Testbase MERiT page.

Can we use a scaled score of 110 as the threshold for pupils working at the ‘higher standard’?

We have used a similar method as the KS2 national curriculum tests to create our scaled scores. The Year 6 SATs use 110 as the ‘higher standard’ threshold (which equates to around 85% of the available marks); on a normally performing test around 25-30% of pupils would achieve this scaled score.

At an individual TEST level, a score of 110 generally puts a pupil in around the top 25-30% (1 in 3), and a score of 120 is around the top 10%. At SUBJECT level, a score of 110 generally puts the pupil in the top 30-40% (1 in 3 to 2 in 5), whereas a score of 120 places them in the top 10-20% (1 in 10 to 1 in 5). However, this does vary depending on the test.

For consistency with the Y6 SATs then, you could use a scaled score of 110 as a broad indicator that a pupil might be working at the ‘higher standard’ as long as other evidence is taken into account. However, our recommendation is to use 120+ as a more robust threshold for securely working at the ‘higher standard’ as this normally includes only the top 10-20% of pupils (depending on the test) and to achieve this a pupil would need to correctly answer most of the more demanding questions on the tests.

Please note, this applies to KS2 only as there is no clear score as to when a child ‘exceeds’ the expected standard in KS1.

It is important to distinguish between the ‘higher standard’ and ‘greater depth’. A simple test score in isolation should never be used to make a ‘greater depth’ judgement. ‘Greater depth’ is a more complex set of skills that require a truly holistic judgement based on broader evidence such as the pupil’s proactive engagement in learning, their depth and breadth of reasoning, etc.

Age-based standardised score

Unlike the scaled scores, age-based scores take into account the month of birth of pupils within their academic year and are based on our analyses of recent test data. They are intended to provide teachers with a clearer indicator of whether an individual pupil is achieving as expected for their age and, when compared over time, whether they are making the expected progress for their age.

Please note that because the age-based scores are derived from the performance of different age-groups in our test data, there are two important details to remember when referring to them:

They are not criterion-referenced, unlike the scaled scores. This means that an age-based standardised score of 100 should not be interpreted as having met any age-related threshold or criteria.

They are for comparison only, i.e. should only be used in the context of the test to determine whether a pupil is performing well compared to other pupils of a similar age, and within their school year. An age-based standardised score of 100 therefore reflects that a pupil is performing as expected for the month of their birth within the academic year.

Pupils who are not within the expected age range for a given school year will not receive an age-based standardised score, for example pupils with delayed entry to school, or those in higher school years taking tests intended for a lower school year.

Monitoring Pupil Progress

Because the scaled scores are calculated the same way for each test, a pupil who is just reaching the expected standard at each testing point in the academic year will achieve a scaled score of 100, i.e. their scaled score will not increase across the academic year. This is because the thresholds reflect an expectation of pupils gaining new knowledge, skills, etc. from the teaching and practice they experience from testing point to testing point.

Please also remember that making expected progress is separate from reaching the expected standard, so even if a pupil is not meeting the expected standard, they can still show expected progress from one test to the next if their scaled score remains similar across the academic year, even if they have not attained a scaled score of 100.

Variation of a few points up or down from one test to the next is to be expected. However, a large drop in performance (a lower scaled score than the previous test) suggests a pupil may not have made progress since the last test.

Equally, a substantial jump up the scaled scores from the previous tests would indicate greater progress than might otherwise be expected.

The same principles apply to the age-based standardised scores too, i.e. if a pupil’s age-based score remains very similar at each testing point in the year for a particular subject, then they are progressing at the expected rate for their age.

Please note: As these tests have been designed to judge whether pupils are meeting the expected standard, they might not test the full depth and breadth within your class. As such, pupils who are working below the expected standard, or are not achieving age-related expectations (ARE) might benefit from other types of assessment more suited to their needs. Pupils working at the higher standard would benefit from further, more challenging, formative assessments to test the depth of their understanding, and to ensure they continue to be engaged and make progress.

APPENDIX

Our Standard Setting Process

There were two panels for each subject, operating independently from each other. They were composed of experienced teachers, test developers and educational consultants, who had no involvement in setting the questions.

Panel members were individually asked to review each item on the tests (as well as review the test as a whole) and to make a judgement as to how many pupils (out of 100) who are working just at the expected standard at the point of assessment would get each question correct. This meant they had to take into account the point in each academic year that the tests are recommended to be used and what would be expected to have been taught at each point. The outcomes were used to create a criterion-referenced test threshold for each test.

An expert panel meeting was then organised and each test threshold was discussed and compared with the live, incoming pupil performance data from MERiT, as well as historical item performance from previous years (to give a norm-referenced viewpoint). The final test thresholds were then agreed amongst panel members based on both expert judgement alongside analysis of pupil test performance.

This process is similar to the one used by the STA.

Outcomes

Testbase considered the recommendations of the two panels before setting the final standards.