octype html public "-//w3c//dtd html 4.0 transitional//en"
Testing is widely used in education for many reasons, ranging from program eligibility and identification of children for special education to intermittent assessment of progress in an educational program. Universities use standardized tests for admission to undergraduate and graduate programs. Many schools, programs, and private companies may need to produce evidence of progress and or to make programmatic or administrative decisions. Ideally, assessment should be used in a formative way to determine how to help learners achieve objectives and goals on an incremental, daily basis. However, tests are frequently used only in a summative way to evaluate what happened.
The key to understanding the differences among learners is to recognize that variation is a rule of nature. The child's body and brain are affected by inheritance, innate differences, and the natural environment. Although there are similarities in the developmental stages of children, there is wide variation among children caused by many factors. Such variation caused Arnold Gesell (1943), the eminent professor and student of child development, to say:
Norms of behavior development, as measures of maturity, must be applied with even greater caution. The lay person should not attempt to make a diagnosis on the basis of such norms. This would constitute a misuse of norms. Refined and responsible application of maturity norms requires clinical skill based on long clinical experience (p. 70).
The terms assessment and evaluation are used interchangeably. Both are intended to mean that a process of information collection is used to make decisions about students; evaluation is sometimes used more narrowly, meaning to make a particular decision or diagnosis. Assessment and evaluation are used synonymously in this text because both refer to making decisions about children based on the general process of collecting information.
Purposes of Assessment
Assessment may involve the collection of information about children from a great many areas. Possible areas of assessment might include the following (obviously some would not apply directly to infants and preschool children, such as academic functioning):
Educational functioning
Readiness
Achievement
Learning style
Social-emotional functioning
Social psychological development
Self-help skills
Physical functioning
Visual
Auditory
Speech and language
Health
Cognitive functioning
Intelligence
Adaptive behavior
Thinking processes
Language functioning
Receptive
Expressive
Nonverbal
Speech
Family
Dominant language
Parent-child interactions
Environment
Home
School
Interpersonal
Assessment may be conducted in the areas above for the following specific purposes:
Screening
Determination of current abilities or needs
Development of a comprehensive program or treatment
(medical, educational, therapeutic)
Screening
Screening, a term derived from sifting, is a process of sorting groups or individuals quickly for purposes of determining if further assessment is needed. For young children, screening may involve a variety of assessments to determine if there is any reason to suspect a health condition, acuity problem, or any problem with sensorimotor functioning. For older children and adults it may also mean to determine if any academic difficulties exist. Screening is a process used to identify a wide range of problems. Physicians and school personnel may actively attempt to identify children with hearing, vision, and speech problems. Screening may involve teachers' or parents' opinions aided by checklists, rating devices, interviews, inventories, records review, and observations.
Screening should be a process to find children who may have problems. If screening and assessment practices are sound, the number of pupils identified by screening should be relatively small, and a majority of students will probably not have problems. If a child appears to have a visual, auditory, or motor problem, on the basis of some screening instrument, the child would be referred for more extensive testing. Screening may cause a child to be referred for further assessment to rule out a problem or to prevent a problem from developing.
Determination of current abilities or needs
Tests may be used to establish a baseline of information or monitor a child's progress over a period of time. An initial assessment for an infant, preschool child, or child in school may be used to simply establish how the child performs in a variety of areas (current abilities or needs) to erect a program of training, education, or treatment, depending upon the purpose. For many young children who are developmentally normal, this may be nothing more than a confirmation of normal development and placement in an age-appropriate program or class. For handicapped children, this may entail many other assessments leading to specific decisions about classification, program placement, and development of an appropriate program.
Development of a comprehensive program
For any child, and particularly those with disabilities, assessment may lead to some kind of comprehensive program---medical, educational, or therapeutic---to prevent, improve, or alleviate a condition identified by assessment personnel. In the case of disabilities this requires a very specific series of actions leading to a diagnosis, placement, and development of an individual education plan (IEP) or individual family service plan (IFSP), which will be described below.
Other than achievement tests, most children in K-12 are not likely to have many other evaluations. After college admission assessment is not common, except for course evaluations.
Program Evaluation and Pupil Progress
Test data are used to measure the impact and effectiveness of programs by examining the gain scores of subjects on various measures over the course of time. Head Start programs have been evaluated with various kinds of test data, but particularly IQ scores and achievement test data. Authorities may examine such scores to make judgments about the overall effectiveness of the program, based on the collective average gains of the students in the program. Many school systems are judged by the SAT, ACT, or other measures of academic achievement. Many computer applications are included in evaluations that compare students' gain scores with students who take a comparable program without computer support.
Statistical Properties of Tests
Some of the most important properties of standardized tests are the statistical characteristics reported in the test manuals, particularly the measures of central tendency, reliability, and validity---and these are often misunderstood or ignored by teachers. There are three measures of central tendency to describe group performance as a whole:
_
mean - The average X
median- The middle raw data point.
mode - The most frequent (f) score; there can be more than one mode.
_
1. Mean X = (add all scores, divide by n; 69, 69, 69, 68, 67, 66, 65, 65, 65, 65, 64, 63, 63, 61, 61, 61, 61, 60, 60 =
1289/20)
Sum of X = 1289 = 64.45
n 20
2. Median (after getting the value of 10.5 scores, count up from the bottom until you get to the midpoint of
the 20 scores; in this case it would be 65).
Mdn. = n + 1 = 20 + 1 = 10.5 (estimated 65)
2 2
3. Mode This is the most frequently occurring score(s).
We have a bi-modal distribution
Mo = 61 and 65
4. Percentile Ranking
A score on a test or other measurement means nothing unless it is related to other scores (relative standing). A 65 on a test is meaningful when we know how it compares. One way to do this is to use the percentile rank or rank order based on a scale of 100. The percentile rank is the percentage of scores below a particular score. The formula is PR = 100 (cf -f)
N 2
Measures of variability
Research and more scientific approaches to describing group performance use measures of variability in addition to measures of central tendency. There are several measures of variability, but the most important are:
Standard Deviation. This is a measure of all variability in a distribution of scores. The standard deviation is a unit measurement that relates the standing of a score in terms of the exact percentage of cases from the average or mean. The standard deviation tells the examiner the percentage of scores falling between points on a test's distribution of scores.
Standard error. Because any score on a test is only an estimate of the true score, the standard error shows the "band of confidence" within which a subject's score may be expected to fall. The greater the standard error, the greater the variation in the estimated score and the less confidence there is in the score.
Reliability.The reliability of a test is the extent to which it will provide the same results (similar scores) on repeated administrations or the ratio of true variance to obtained variance. If a student has a certain score one day and a much different score on the same test a few days later, the test is not very useful because it is unreliable. Reliability coefficients of .60 are appropriate for screening purposes. Tests are usually considered highly reliable if they reach a coefficient of .90. Of course, there are other types of reliability, such as equivalence, stability, and internal consistency which should be examined. A standard reliability measure used in survey instruments is the Chronbach Alpha, which can be easily computed on the SPSS.
Validity. The validity of a test is the degree to which it actually measures what it is said by its developers to measure. Just because a test maker says a test is valid does not mean it is. Validity is difficult to determine in education and psychology because we are not always sure that a trait really exists or the instrument we use is actually measuring the presumed trait. This is characteristic of personality traits, and also for many of the "processes" presumed to underlie learning.
Types of validity are content, criterion-related (concurrent and predictive), and construct. Content validity concerns objectives in the domain (e.g. reading objectives). Concurrent validity refers to a correlation between a given test and another test administered at the same time. Some examiners and test developers often correlate two tests and presume they measure the same thing. Predictive validity refers to the ability of a test score to predict performance in the future. Predictive validity is difficult to achieve, but this does not mean that tests will not be used with this presumption. For example, SAT scores have been shown to add virtually nothing in predicting college performance based on GPA, but they are still used, regardless of the expense.
Construct validity is the method used to determine that a test measures more subjective or theoretical conditions (constructs) such as personality traits, e.g. moodiness, or processes in learning disabilities.
Methods of Determining Validity
Content validity---Items have content validity if they ask students to demonstrate those skills and competencies required by the objectives. This is usually determined by expert opinion or systematic sampling of items based on objectives.
Criterion-related validity---the correlation of measurements with an external criterion (using a correlation; numerical value). Refers to predictive and concurrent validity.
Predictive---the ability of a test to "predict" something in the future.
Concurrent---the ability of a test to predict or identify subjects immediately based on some classification or trait.
Construct validity---determining the validity of a construct is important when the test user wishes to infer the degree to which the individual possesses some hypothetical trait or quality (construct) presumed to be reflected in the test performance (e.g. neuroticism, creativity, anxiety, etc.).
Predictor tests are constructed to sample the skills, attributes, or traits required by the criterion. If the coefficient is low, there is little relationship between the two. Predictive validity coefficients vary considerably, but correlations of .60 or .70 are considered to be high. A .90 would be excellent.
Norms. The norms of a test are the scores reported for the people who were originally tested to standardize the test. If a person who takes a test has a background and experiences that are much different from those of the persons in the normative sample, the test may be unfair. The issue of norms is a basic concern in the testing of minority students, because they are compared with norms that reflect different experiences, opportunities, and privileges.
Types of Scores
Percentile
The percentile is one of the most common and easily interpreted type of test score. The percentile is the rank of an individual score in a distribution of scores for 100 subjects. A score at the 5Oth percentile means that the performance of the subject is equal to or greater than 50 out of 100 people; a score at the 85th percentile means that the performance is equal to or greater than that for 85 people out of 100. It is important to recognize that a percentile score of 50 is not inferior.
Developmental Scores
Developmental scores are easily understood also because they can be intuitively interpreted in terms of chronological age, mental age, or grade equivalents. The grade equivalent is very common in school reports but chronological age and mental age may be used with younger children and infants. It is important to remember that there are wide differences in developmental scores for young children because of reliability problems of instruments. These scores are of limited usefulness because they do not relate to equal intervals, as in the case of standard scores. They also vary greatly from test to test, depending on the learning experiences of the sample used to develop norms for a test. Thus, it is not uncommon to find highly disparate scores for the same subject in a series of similar tests.
Standard Scores
Deviation IQ scores on individual intelligence tests, Z-scores, T-scores, and stanine scores are examples of standard scores. Such scores are related to percentiles and can be interpreted in the same way that standard deviations are interpreted.
Types of TestsMany types of tests exist which can be classified into one or of the following categories: individual tests, group tests, objective, subjective, power (no time limit), speed, verbal, nonverbal, non-language, criterion referenced, norm-referenced, and so forth.
Group and Individual Tests
Tests may also be classified as group or individual. A group test is used with an entire class or other large group. An individual test is designed for use with one student at a time. For the most part, because of the precise information that can be obtained and the depth of the examination, only individual tests are used in assessment for the purpose of deciding on the placement of handicapped students. In group tests, the students must be motivated, must usually be able to read and follow directions without assistance. Some professional organizations, like the Council for Exceptional Children and the American Psychological Association, have taken the position that group tests should not be used for any reason other than screening.
Informal Assessment
The teacher is typically in the best position to provide information about students, spending many hours with them. In cases of special education, informal assessment information and samples of student work may be included in the evaluation. But informal assessment can be a useful tool for the classroom teacher as a method of formative assessment for use in daily instruction.
Checklists and Rating Scales. There are two popular forms of "informal" assessment procedures commonly employed in schools: the checklist and the rating scale. Some companies produce such instruments which may also have normative data. Checklists and rating scales may be developed at the local level by someone in the school or may be developed at the district level. Teachers may also develop their own instruments or use those provided to them. A checklist is simply and indication of the presence or absence of some behavior, A rating scale requires some kind of measurement about the presence of a behavior and its degree or frequency. Whoever uses the instrument must provide some judgment about the behaviors of the student, so the system should be as fair and valid as possible.
To the extent that it is possible, the development of a checklist or rating scale should be similar to the development of any test. A fair test in a classroom would not include skills a child has not had a chance to learn nor skills that are not relevant to the classroom. So in development of a checklist or rating scale, the behavior checked or rated should have educational relevance, and it should be something that can be observed rather than inferred. In the case of a rating instrument, the behaviors should be rated across a range to allow for discriminations or distinctions.
In a rating scale, descriptors or numerical ratings can be used. Numerical ratings should have a spread of 3, 5, or 7 points (i.e., 0 1 3; 1 2 3 4 5; 1 2 3 4 5 6 7). In a numerical system, there should be a decision made at the outset about what is high or low (e.g. 1 is high, or 7 is high), and this should be consistent. For example, a rating might be used for student behavior:
Sad 1 2 3 4 5 6 7 Happy
A descriptive scale may include a continuum using words that describe the behavior in more detail, such as:
___ Depressed most of the time and cries a lot.
___ Frequently depressed and cries occasionally,
___ Seldom depressed or unhappy.
___ Usually happy.
___ Happy and pleasant almost all the time.
Other forms are as follows:
Motivation: A B C D E
Excellent Good Fair Poor Deficient
Mood: Happy Typically Sometimes Typically Depressed
Happy Sad Unhappy
Observation. Observation of children or adults with a systematic approach to record instances of behavior can give an accurate description of behavior or performance within different contexts and environments. This is a useful kind of information for making important educational decisions. In fact, this is also referred to as ecological assessment, the processes of collecting data and making decisions about behavior in a natural environment instead of a controlled, experimental, or test situation. There are two general approaches, continuous recording and sampling.
The best observational approaches are those designed around valid traits of observation and interrater reliability or agreement. Interrater reliability refers to the ability of two or more observers to agree on what they observe. This relates to the ability to define the behaviors that will be rated and a uniform method of rating and recording information. In other words, the traits mentioned above---mood, happiness, motivation---are extremely difficult to define. It is much easier to count the frequency of certain behaviors than to infer from observation the "internal state" of a child. Some children may seem to be unhappy may, in fact, be quite well adjusted. Some children who may seem to be unmotivated may, nonetheless, be quite productive in their work. Such vague descriptions as mood or motivation lead to extremely subjective evaluations that can be contaminated by the opinions of the observers and may be a better index of what observers think they see than anything real about the children observed.
This leads back to the original point that observation is a better tool than a checklist or rating device, and the best observations are based on well developed instruments that require interrater reliability. Techniques that may be used in observation include:
Frequency Recording. This is a method of simply recording occurrences of certain behaviors over a period of time, Examples might include getting out of the seat, seeking assistance, bothering others, and so forth.Duration Recording. Another method relates to behaviors that may not occur frequently but which may last for a long period of time. For example, a child who becomes very upset and will swear, destroy things, or have a "tantrum" may not do this frequently, but may do it occasionally for a certain period of time. In such instances, the frequency and the duration of the events would be important.
Time Sampling. An observer can watch children for certain periods of time, either predetermined randomly or selected on the basis of some intuitive approach. The observer may select 15 to 30 minute periods of time randomly when behaviors will be recorded.
Interval Recording. This is an alternative to other methods. Using a time period, behavior is recorded according to some interval, such as 10 second recordings. Interval recording has been found as useful as continuous recordings or assessment of videotapes.Regardless of the system used, if the teacher is the recorder it must be efficient enough so that it will not interfere with the teaching responsibilities of the teacher. Some systems of observation are so complicated that teachers do not use them.
During the 1960s and 1970s, behavioral psychology was extremely influential in education, particularly in special education. This system usually required the collection of baseline data over a period of time to establish the typical behavior of a pupils, sometimes lasting one or two weeks. After the baseline data was collected, the teacher would decide what intervention strategy to use, such as rewarding certain behaviors, ignoring certain behaviors, using time out, and collecting data for a period of time during which the "treatment" or intervention was in use. Comparison of the child's baseline behavior with intervention data, shown on a graph or chart, was the basis for making determinations about the effectiveness of the treatment.
Many schools are using student-centered models of instruction guided by such educational philosophies as "cooperative learning," "reflective teaching," "multiple intelligences," and "constructivism." A student-centered curriculum is based on the principle that learners must be engaged in appropriate instructional activities for growth, development, and learning, and such activities must be matched to the needs of each learner and---to the maximum extent possible---directed by the learner. When the focus is on the learner and not a uniform curriculum, educators must use a wide variety of instructional approaches individualized for each student. By necessity this requires a different orientation to student assessment.
Increasingly, performance-based assessments have been used to show what children have learned and to encourage higher levels of achievement, although many groups are critical of this movement and put pressure on school boards and state legislatures to prevent it. With this emerging view, the purpose of assessment is regarded as a way to provide teachers with information to improve learning, not to use test results to award grades or compare students. For performance-based assessment to be useful for improved learning, it must be valid and reliable. A good assessment program can help show strengths and weaknesses, motivate students, and provide useful information for students, teachers, and administrators. A learner-centered approach actively involves the learner in all aspects from planning to evaluation.
The terms assessment, testing, measurement, and evaluation are used almost synonymously, but each actually has a unique meaning. A test is an instrument or a procedure to determine how well a student performs in contrast with others or with performance tasks. Measurement is the degree of a trait or skill possessed by a student, according to some scoring system. Evaluation is the process of collecting and interpreting information about how well a student meets instructional objectives of the curriculum. Assessment is often used interchangeably with evaluation by many writers, although assessment can also incorporate information about the learning environment, the teacher, the home, and other factors in addition to the learner's individual characteristics.
Testing is widely used in education for many reasons, ranging from program eligibility and identification of children for special education to intermittent assessment of progress in an educational program. There may be a need to produce evidence of progress and such data may be used to make programmatic or administrative decisions. Ideally, assessment should be used in a formative way to determine how to help a child achieve objectives and goals on an incremental, daily basis. However, tests have frequently been used only in a summative way to evaluate what happened, often after it is too late to intervene.
There are generally two kinds of data used in educational assessment or evaluation, quantitative and qualitative. A quantitative measurement uses values from an instrument based on a standardized system that intentionally limits data collection to a selected or predetermined set of possible responses. Qualitative measurement is more concerned with detailed descriptions of situations or performance, hence it can be much more subjective but can also be much more valuable in the hands of an experienced teacher. Performance-based assessment in most schools is based on qualitative evaluations of student performances.
What is performance-based assessment?In performance-based assessment the examinee performs some task that requires an in-depth understanding of a skill rather than just reciting knowledge or recalling facts. The student performs a task instead of choosing an answer from a list, such as explain historical events, postulate scientific hypotheses, solve math problems, talk in a foreign language, or carry out a research project. Teachers judge the quality of the student's work according to agreed-upon criteria.
Traditional testing programs have been concerned with comparisons of individual students, teachers, schools, and school districts on the basis of scores. Performance assessment requires students to construct responses to problems rather than react to pre-selected items such as multiple choice, true-false, and matching. The student "performs" in some way to demonstrate competency in skill or knowledge, which can be judged or evaluated in terms of acceptable criteria. Technically, the term "performance" refers to the kind of student response that is examined, and "assessment" refers to how the performance is judged according to some set of criteria (rubrics), commonly based on a point system associated with criteria that describe the performance necessary for points. Examples of student work that depict criteria are sometimes called anchors.
Assessment can be based on oral questioning, interviews, observation, and evaluation of products and performances. Tasks used in performance-based assessment include essays, oral presentations, open-ended problems, hands-on problems, real-world simulations and other authentic tasks. Such tasks are concerned with problem solving and understanding. Just like standardized achievement tests, some performance-based assessments also have norms, but the approach and philosophy are much different than traditional standardized tests. The underlying concept is that the teacher and the child should produce evidence of accomplishment of curriculum goals which can be maintained for later use as a collection of evidence to demonstrate the child's achievements, and perhaps also the teacher's efforts to educate the child.
Performance-based assessment is sometimes characterized as assessing real life, with students assuming responsibility for self-evaluation. Testing is "done" to a child, while performance assessment is done by the child as a form of self-reflection and self-assessment. The overriding philosophy of performance-based assessment is that teachers should have access to information that can provide ways to improve achievement, demonstrate exactly what a student does or does not understand, relate learning experiences to instruction, and combine assessment with teaching. The hope is that programming can be more flexible and related to individual needs and differences. Evaluation can focus on what students can eventually learn, rather than on intermediate test scores, and both the teacher and student can be directly involved in selecting assessment activities.
Different names for the same thing. Vermont pioneered performance-based assessment on a statewide basis, and there are active efforts in California, Illinois, and Virginia, among other states, and many local districts and some universities have promoted performance assessment. This leads to different terminology and approaches, although the general procedures are similar. Perhaps one of the greatest single influences on performance-assessment methods and terminology has been the work of Wiggins (1993), whose writings have become the primary source for training and inspiration. Although we have decided to use the term performance-based assessment in this book, many terms are used to describe the same thing, terms that are often used interchangeably and can be confusing. In some cases there are subtle differences among them and in others there are very different meanings, but they all refer to the use of assessment techniques that focus on the developmental progress of students.
Other terms found in the literature are alternative assessment, authentic assessment, direct assessment, and portfolio assessment:
Why is performance-based assessment so popular?Reasons for the popularity of this trend include the many changes in U.S. education in response to reform efforts and changing educational philosophy, moving away from direct instruction. Collins (1991) has documented characteristics of schools that are changing from traditional patterns of instruction:
- There has been a shift from whole-class to small-group instruction.
- There has been a shift from lecture to coaching.
- There is increased time on task and engagement.
- There is a shift from summative tests to performance assessment.
- There is a shift to cooperative learning.
Due to such shifts in education, new methods of assessment are also needed, which explains the popularity of performance assessment. Assessment is tied to instruction, so traditional standardized tests are no longer sufficient to provide the information necessary for newer forms of teaching and learning. Performance-based assessment uses continuous and frequent assessment to guide students and teachers, so that assessment data are used to make changes and improvements. Summative assessments are more robust and comprehensive with this method. Performance-based assessment is also more sensitive to small-group and individualized learning and to teaching methods that depart from whole-class and lecture methods.
What is the philosophy of performance-based assessment?Performance-based assessment is founded on certain philosophical beliefs and practical considerations determined by the changing context of education:
Many teachers are interested in the learning styles of children, and there is great promise in using performance assessment to show how children respond to various learning tasks and environments.
Self-assessment. One of the goals of constructivist classrooms is to stimulate an interest in learning and
self-assessment, something difficult to achieve in a with traditional grading and instruction.
Improvement of self-esteem. Children can receive feedback from samples of their own work that show
change and improvement.
Identification of gifted performance/abilities. Rather than use intelligence tests that are highly correlated with socioeconomic status, actual performance of students in a variety of areas can be evidence of talent.
Determination of eligibility for programs. Similar to identifying talented children, evidence revealed through performance may be much more useful than a standard score in showing the needs of some students for special education. However, even disabled children may also have talents.
Readiness for advanced placement. Children who clearly excel in certain academic areas can demonstrate their
abilities through carefully selected projects and classroom products.
Explanations to patrons. Test scores and letter grades are not as effective in communicating achievement to parents and the public as actual actual work of students.
Is standardized testing really bad?Many advocates believe that performance assessments can help reform American education by changing teaching and learning from a passive to a more interactive process, giving teachers more authority, thus empowering them, and that such assessments can motivate students to aspire to achieve.
In advocating performance assessment, standardized achievement tests are often criticized, and sometimes unfairly. Standardized testing has been used in American schools since the 1930s. By the 1980s, literally millions of standardized tests were administered to children for many purposes, including admission to special education, promotion, graduation, competition for scholarships, and so forth.
Critics have attacked standardized tests for their gatekeeping functions. Many interest groups are concerned about exclusion and tracking related to ethnic, gender, or socioeconomic status caused by test scores. Other critics believe that pressures placed on children and teachers to show gains detracts from improving teaching and learning because of drilling for tests.
Achievement tests do not measure all the skills, interests, and abilities of a child, although they were never intended to, so what they assess is very restricted and may not always match the local school curriculum. Direct assessment---as a part of the natural events in the classroom---relates assessment to the child and can be used as a means to connect assessment with teaching on a daily basis.
The positive aspects of norm-referenced tests include an objective scoring format, standardized procedures for presentation, and norms for comparison of a person's score with others representing the same age and other characteristics. The test is designed so that a representative sample of subjects across certain ages and grades will be included. Achievement on the test is determined by how the subject compares with the those who took the original test, which may be expressed in terms of a standard score, percentile, or some other score. This permits global interpretation of performance, but provides very little information that is useful, but most test developers don't even claim their tests can provide much more than a score.
A typical norm-referenced test does not reveal what a student can or cannot do, or really even what a student has learned. What it really does is permit a comparison or relative standing of the student in relationship to the subjects who established the norms. If the norms are scores of persons who differ greatly from the person who takes the test, or if test items assess knowledge or skills the student has not been exposed to, the test will be unfair. Relevant educational experiences may be linked through performance-based assessment in ways that are just not possible with multiple-choice tests, but most standardized achievement tests are valuable if they are used properly.
Since the 1983 report A Nation at Risk there have been many reports that American education is failing.
A critical look at data demonstrates that schools are actually performing better than ever. The most famous indicator of school performance is the Scholastic Aptitude Test (SAT), which is reported annually and widely covered in the media. Since 1995 the SAT scores have actually been higher than before. The Sandia Report (1993) concluded that the average SAT scores were about 5% lower in 1992 than in they were in 1960, with verbal scores and math scores dipping in the 1970s, but increasing significantly in 1985. Scores for mathematics remained flat, and the average verbal went up and down but returned to the same average as the 1980 score.
One of the most important considerations in interpreting SAT data is the fact that only students intending to go to college take the test. In the past, before college education became so critical to most occupations, relatively few students attempted the test and these also tended to be the best prepared to go to college. Between 1960 and 1990, reflecting changing patterns, many more students in the bottom ranks of high school classes (60%) took the test, along with many foreign and ESL students. As a result, according to the Sandia Report, half the decline in average SAT scores can be explained by the demographic characteristics of students who took the examination.
Overlooked is the fact that the gap between minority and white students narrowed, there was a significant increase in the number of top scoring students (65%), despite the fact that non-traditional students had taken the examination, and the test is more difficult than it has been in the past, according to the Educational Testing Service.
An important question about the SAT, or any test for that matter, concerns its validity. There is ample evidence that the SAT is not valid, which is disappointing and detrimental. It is much easier for a newspaper reporter or politician to see a 2 or 3 point drop in average test scores and believe that is meaningful that for them to truly understand validity and reliability as statistical concepts in test development. They easily grasp the former concept and do not bother with the latter. A basic concept in validity is that the test is comprised of subjects in the normative group who are similar, and as indicated above this has not been the case. Also, it is presumed that the SAT has predictive validity, but many students who perform poorly on the test have succeeded in college and in life. Moreover, the test has come to be a measure of school district or high school quality, misinterpreted by many as an indication that some schools are great and others are poor, purely on the basis of a select group of students in particular districts who happen to take the test. This is borne out by the fact that the SAT differentiates not only by income and by their parents' role in the economic system. "The average scores of the children of professionals are higher than the children of white collar workers, which in turn, are higher than the children of blue collar workers. High school rank, which is a better measure of academic achievement than SAT scores, shows no such correlation" (The SAT Aptitude or Demographics?).
There is evidence too that the SAT is less a measure of overall ability in math and verbal skills than a test of some students' abilities to solve some problems more quickly than others and to do it in the face of a long testing session. Speed and endurance are critical factors in scoring well (Berliner & Biddle, 1995; Stedman, 1995; Stedman, 1994a; Stedman, 1994).
Should performance-based assessment replace norm-referenced tests?Although many advocates of performance-based assessment and critics of standardized tests may want to abolish standardized achievement tests, it is likely that both forms of assessment will remain in schools, and many developers will try to merge both types of tests so that both performance and normative information are available. In time, performance-based assessment procedures will be valued by teachers less as a replacement for multiple-choice tests than as a way to improve daily instruction. For this reason, performance-based assessment will probably coexist with traditional achievement tests but not replace them. Only a small group of writers, such as those in the whole-language movement, entirely repudiate standardized tests.
With the spread of state legislation to require standardized test scores as a method of "grading" individual schools and districts, standardized testing is alive and well.
Is performance-based assessment likely to be accepted by teachers?For the most part, traditional achievement tests are easily administered by teachers, who have no responsibility for the development and scoring of the tests. Performance-based assessment is much more demanding of the teacher, but this may be one reason why many teachers want to use the approach. It matches more closely what they do in classrooms and they have a hand in the process. Early childhood and early primary teachers already do many of the kinds of things that are used as performance-based assessment methods, so they take to it more naturally. Teachers in the fine arts and physical education make individual assessment of student performances.
Some teachers in the upper elementary and middle grades may be more reluctant to embrace the program without a lot of training and support, but this may first require a change in the curriculum and daily instructional routines. High school teachers may resist performance-based assessment because secondary teaching is often based on lecture. But in secondary schools and in some universities, teachers are beginning to embrace performance-based assessment.
Due to the many changes in education, teaching methodologies have altered significantly in recent years and teachers will find that performance-based measures can be helpful to them in meeting the goals of new methodologies. As more teachers steer away from large group instruction and a singular reliance on textbooks and worksheets, they are likely to become more reflective. More and more classrooms at all levels now incorporate small-group activities, cooperative learning, and computers, all of which require changes in the basic context of the classroom and curriculum materials. This leads to the need for different kinds of assessment, and teachers will see the value.
A good example is whole-language instruction (Goodman, 1992), which forces teachers to depart markedly from accepted modes of classroom organization. Whole language instruction has become popular as a performance-based alternative to basal reading programs. Integrated instruction, especially in language arts, means that schools no longer teach isolated reading and language arts during different periods but combine them, even in mathematics, social science, and science. Elements of reading, writing, and mathematics are not separated into independent, isolated learning objectives to be learned individually, but skills are taught as they become apparent for achievement.
Criticism of performance-based assessment.Naturally, some advocates of standardized testing are critical of performance-based assessment, if for no other reason than because standardized testing is threatened, but even performance-based assessment advocates have concerns and reservations about the use of such procedures because of the time, resources, and effort required to develop a new system of assessment. Some critics fear that, like educational fads in the past, the performance-based movement has little or no research base and that in a few years we may have committed to a movement that will be found lacking and ineffectual.
Although there are research reports to show the promise of performance-based assessment, it will take many more studies to satisfy critics.
Without norms (although some performance assessments include norms), there may be concern among parents, teachers, or the school board that children are not learning or will not match levels of achievement in other schools or districts. The explanation most often given in such circumstances is that performance measures are tied closely to what children are actually doing in classrooms and that a collection of performance information will clearly show the child's growth and achievement. Moreover, most schools attempt to develop a common set of expectations for children, as well as common performance assessments that provide for comparisons. While validity of performance assessment may be high, because of the nature of the process used in establishing assessments related to classroom activities, reliability of assessment is always a concern. As will be described below, the development of standards or "rubrics" is critical for reliability.
Most educators and psychologists have always been concerned about the statistical qualities of instruments, particularly validity and reliability. There is always the danger that, without objective criteria, some teachers may unfairly judge some children because of favorable or unfavorable biases. While this may be possible with the award of subjective grades, the record of a student in a portfolio can be examined by a third party who may examine the products without bias.
The Three "P's" of Performance-based assessmentIn broad terms, there are three types of performance-based assessment: performances, portfolios, and projects. The determination of differences among performance, portfolio, and projects can be rather loosely interpreted, but the differences are distinct enough to permit separate classification among the different categories. Examples of school tasks that may be included in performance-based assessment are:
Art work
Cartoons
Collections
Designs and drawings
Documentary reports
Experiments
Foreign language activities
Games
Inventions
Internet transmissions
Journals
Letters
Maps
Model constructions
Musical compositions
Musical scores
Notebooks
Oral reports
Original plays, manuscripts, stories, dances
Pantomimes
Performance on a musical instrument
Poetry recitations
Photos
Plans for inventions
Problems solved
Puppet shows
Reading selection
Recipes
Scale models
Story illustrations
Story boards
PerformancesPerformance may be simple or complex, intermediate or summative. A child may demonstrate knowledge of addition (regrouping) by using concrete objects instead of a test item on a paper. A child may classify objects, such as Venn diagrams, plants, animals, or perform more complex tasks, such as setting up a database. While a student may perform a musical piece, the same strategy can be applied in many skills and subject areas. Rather than taking a printed test to check reading skills, the child may read (perform) a passage or tape record oral reading, for example. Recitations, oral reports, and group presentations lend themselves to performances in social studies, foreign languages, mathematics and science, and other parts of the curriculum. This can vary with the imagination of the teacher and the students. Any knowledge or skill that can be demonstrated may has the potential for a performance: oral presentation, lab demonstration, reading, debate, dance, or dramatization.
PortfoliosPortfolio assessment is appropriate for ongoing evaluation, such as writing, reading and fine arts skills. The portfolio is a way to physically store work of students, which can include video or audiotapes of performances. Even the work that is not written can be videotaped and placed in the "portfolio," which may be nothing more than a box. The portfolio is a collection of work assembled by a student over a long period of time. Portfolios are used for science, mathematics, language arts, literacy, and the arts. Portfolios could include samples of students' paintings, drawings, stories, letters, poems, lists, signs, handwriting, and use of numbers. Portfolios, collected frequently over a period of time, can serve as a collection of work to show a student's achievements. They are based on multiple sources of information that can be collected over a year or several years in the school setting. Therefore, the work can give a developmental history of the pupil and lead to predictions about potential. The documentation amassed in a collection can contain both the best work and the ordinary work of the student.
Portfolios may consist of audiotapes videotapes, written works, art work, sculpture, photographs and many other forms. An active portfolio assessment will soon take up much space, requiring considerations to be made about how long items are kept. As technology is becoming more available and affordable, it is possible to store many items in a secondary format, such as videotape or digitized video. Sophisticated development of assessment items may also improve delivery and use of items for teachers. In Maryland, a statewide technology-based staff development project includes a multimedia database for providing teachers with research on teaching and learning. Jay McTighe of the Maryland Assessment Consortium has fostered the development of a database to include performance assessment tasks for teachers to use.
ProjectsTo assess the ability to reason, gather and organize facts and ideas, and to provide an integrated work, projects may be used.
A project may be a problem-solving approach that entails the presentation of a problem that must be solved by applying theories and formulas. Problems may be in science, math, and social studies, for example. A project can consist of a short- or long-term activity of an individual student or a team that results in a product that represents some considerable effort. Depending upon the teacher's goals and the student's capabilities or interests, projects can be very simple or quite complex. Either as a separate subject area or an integrated product, a project can be prepared in many forms. Children can create original written works, such as prose or poetry, photographic essays, musical productions, dances, and technical-type papers, ranging from experimentation to research.
Some considerations about portfolios are: How often should students be expected to archive materials or add to their portfolio? Weekly? Monthly? Who should have access to portfolios? Students? Teachers? Administrators? Family? Visitors? Who is responsible for storing and maintaining portfolios? Students? Teachers? Administrators? Family?
When will students work on portfolios? Special time? Any time? During certain lessons? Evening? At home?
How much time will students be allowed to work on portfolios?
In establishing evaluation criteria for performance-based data, decisions must be made about what kind of student work to include: best work only, longitudinal work, and so forth. In language arts, mathematics, or in integrated subjects, the purpose should be to show a record of a student's development. Therefore, not only the student's work should be included but also the comments and feedback provided by the teacher and copies of work of others who may have collaborated with the student.
To establish reliability, some programs also attempt to develop norms for performance assessments, but many schools do not. It is critical that the school express precisely the skills they want students to achieve and the activities that will be used to lead students to succeed. One method gaining wide acceptance is the development of "rubrics" (rating scales) for scoring products and performances associated with standards set for the class. In other words, the standards must be evident and clear to students, and models or examples should be available for students to know what is considered acceptable performance or mastery. The characteristics of acceptable performances, portfolios, and products should be developed and refined in order to specify standards. Standards should be clear enough to permit objective evaluation by other experienced teachers; there should be agreement or interrater reliability.
Rating scales for rubrics are based on the point system so common in ratings and questionnaires, the "Likert" type item. For example, work may be rated as poor, fair, good, and excellent, with corresponding values of 0, 1, 2, and 3, ranging from unacceptable to comprehensive evidence of expected qualities of performance.
Developing rubrics and anchors for the breadth and scope of the school curriculum is the most difficult and time consuming activity of schools as they implement the performance-based assessment program. A considerable amount of time is also accounted for in judging students' work according to rubrics. An example of a generic rubric for a language arts activity follows (could be associated with a particular story or book):
3 (Exceeds standard)
Student expertly interprets characters and plots and shows an exceptional social interactions and themes. In following all directions of assignment, the student provided precise information in exceptionally clear format without any structural or mechanical errors. The student provided many specific examples to support his/her ideas.
2 (On standard)
Student shows a good understanding of major characters and events. All aspects of the assignment were addressed. Although there were some minor inaccuracies about information and the structural or mechanical clarity, appropriate examples were used to support ideas.
1 (approaching standard)
The student revealed a beginning understanding of the characters, events, and plot. Some errors in factual accuracy and mechanics were evident. The student followed some of the directions and provided some specific examples.
0 (below standard)
The student shows limited or no knowledge of character, events, and the plot. Few or no directions for the assignment were followed. There were many errors and no specific examples were provided.
Performance assessments requires students to work according to specific tasks with well-defined conditions, and they are aware at the outset that their work will be evaluated according to the expressed standards. Because this approach requires students to actually demonstrate what they know, performance assessment is a better indication of knowledge and skill.
Performance assessment can be supplemented with observation and interview data, based on rating scales and checklists. Such evaluations may be used to rate student behaviors and interactions, but specific tasks or performances should be evaluated with more thoroughly developed criteria, as in rubrics. Documentation of observations can take the form of checklists, rating scales, anecdotal records, audio and videotapes, and photographs. With such information, rather than test scores in a cumulative record, the portfolio can reveal many things about the child who created the contents. Evaluation of can lead to many outcomes that are typically impossible with traditional norm-referenced tests or multiple-choice tests.
While products or projects may be easy to determine and to establish associated criteria (rubrics and anchors), dealing with developmental skills in the curriculum, such as reading, writing, and math, may be more difficult. However, such skills are often considered within categories, such as emergent, beginner, competent, and fluent. For example, Helen Martinson, of Dutch Neck School, Princeton Junction, New Jersey, reports that the first grad teachers use a form for a reading conference that shows both the passages read or attempted by a student and the assessments rated by the teacher. For an emergent reader, for example, the teacher shows the book "Spooky Old Tree" or xeroxed passages to the parent, then provides assessments of fluency, comprehension, and strategies used to teach the child, and comments about performance. As the child moves closer to a beginning reader and then to a competent reader, the teacher can show progress with new reading passages from different books and also the assessments of progress. This is much more meaningful information to the child, parents, and teacher than a single score from a reading test.
Some schools improve reliability by having teachers in other schools read a random sample of work and use the rubrics to see if they reach common levels of evaluation. They may also look at supporting information to determine if the teachers are providing enough feedback and direction to students or expanding on the curriculum sufficiently.