Assessment Practices in Informal Science

January 1st, 2016

This Knowledge Base article was written collaboratively with contributions from Yoon Jeon Kim and Mac Cannady. This article was migrated from a previous version of the Knowledge Base. The date stamp does not reflect the original publication date.


There is a great need to develop assessment practices within informal science learning environments that can keep up with the innovations in the field and support both practice and research. The need for assessment tools that measure a range of dispositions and capabilities is well documented:

One of the main challenges at present is the development of means for assessing participants’ learning across the range of experiences…. Rigorous, shared measures and methods for understanding and assessing learning need to be developed, especially if researchers are to attempt assessment of cumulative learning across different episodes and in different settings… At the same time, the focus of assessment must be not only on cognitive outcomes, but also on the range of intellectual, attitudinal, behavioral, sociocultural, and participatory dispositions and capabilities that informal environments can effectively promote (i.e., the strands). They must also be sensitive to participants’ motivation for engaging in informal learning experiences, and, when the experience is designed, assessments should be sensitive to the goals of designers.

These recommendations from Learning Science in Informal Environments: People, Places, and Pursuits (NRC, 2009) present the field with a call to action. While it is clear that there is a need for assessments that match the focus and spirit of informal science learning, we have much to learn about what works and does not work in a variety of informal STEM learning contexts.

With that said, there is a great field of research on assessment in other contexts that can be used to guide the development and use of assessments in informal science learning. For example, Samuel Messick (1994) asserts that “Such basic assessment issues as validity, reliability, comparability, and fairness need to be uniformly addressed for all assessments because they are not just measurement principles, they are social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made.” (p. 13). With this in mind, we describe best practices in assessment design and use that have been identified in other fields.

Findings from Research and Evaluation 

Best Practices for Assessment

Best Practices for Assessment in Informal Science should be:

Providing Validity Evidence— The question of validity is whether the results of an assessment are an accurate measure of the target construct. Without evidence supporting the inferences to be made from assessment in a given context, it is unclear how the results can be useful. Baldwin, Fowles, and Livingston (2005) note that, “An assessment is valid for its intended purpose if the inferences to be made from the assessment scores (e.g., a learner has demonstrated the ability to write analytically) are appropriate, meaningful, useful, and supported by evidence.” One example of providing validity evidence for a measurement tool can be seen for this measure of science learning fascination.

Evidence-Centered Design (ECD for Dummies for overview) is a framework that can be used to ensure validity during the design phase of assessment. The central principle of ECD is that educational assessment is an evidentiary argument. ECD guides the design and implementation of assessment as a principled process by formalizing the assessment structure to systematically align students actions in science learning with the specific outcomes about which the stakeholders wish to evaluate.

Reliable â€“ Reliability typically refers to the internal consistency with which the instrument measures the targeted construct. For surveys, tests, or other item-based instruments reliability is typically calculated as various forms of intercorrelation coefficients (DeVellis, 2003). Alternatively, reliability can be viewed as the extent to which responses are free of measurement errors (AERA, APA, & NCME, 2014). This view allows assessment designers to identify and address possible errors in the early design and development stages (Abell et al., 2009) for a wide variety of assessment tools (e.g. observation protocols, embedded assessments, rubrics).

Fairness – Fairness in educational assessment has four meanings: (a) lack of bias, (b) equitable treatment in the testing process, (c) equality in the use of outcomes from testing, and (d) equal opportunities for different subgroups to learn (AERA, APA, & NCME, 2014). Bias in assessments refers to systematic differences in responses that are not relevant to the targeted construct. This could be a limitation in reading ability having an impact on a math test, but also includes performance prompts that might elicit different behaviors from different subgroups (Linn, Baker, & Dunbar, 1991). For example, asking what colors bananas are (green, yellow, or brown) has cultural implications. Some cultures eat green bananas, others consider bananas yellow, and cooked bananas are brown, yet only yellow is considered a correct answer. Thus, a construct irrelevant cultural influence has an impact on the responses across subgroups (Richards & Schmidt,2013).  Similarly, the assessment process should not be varied such that there are construct irrelevant influences on the responses. For example, if accommodations are available for some learners due to poor eyesight, those same accommodations should be made available for all assessment takers if eyesight is irrelevant to the construct of interest (Russell & Kavanaugh, 2011). Further, the uses of outcomes of assessments should be equal across learners, implying that the consequences for some respondents should not be different from the consequences from other respondents. Insofar as assessments are able to provide opportunities for learning, these opportunities should be available to all subgroups.

Ongoing â€“ Learning, including informal science learning, is active (Driver, Asoko, Leach, Mortimer, & Scott, 1994), situated within authentic contexts (Greeno, 2006), and occurs in complex social environments (Bransford et al., 2006). Therefore, assessment evidence is more reflective of the learning process when gathered over time (i.e., continuous) and across contexts, activities, and social environments. Learners’ knowledge, skills, interest, and mindsets are continually updated based on multiple observations in diverse contexts rather than a single observation at one point in time. (DiCerbo & Behrens, 2014).

Formative â€“ Information derived from the assessment is meant to be actionable for learners, educators, and facilitators to provide timely scaffolding to further learning. The essence of formative assessment is the design of an activity that (a) elicits evidence for the skills and knowledge of interest, (b) requires learners’ adjustments in actions, and (c) motivates learners to learn (Kim, 2014). Well-designed informal science learning activities already carry those features, their assessments should too.

Performance-based â€“ Learners should be able to demonstrate their ability by showing the processes of solving meaningful and interesting problems. In performance-based assessments, what is being assessed is often not apparent (Rupp, Gushta, Mislevy, & Shaffer, 2010). Therefore, the assessment designer begins by figuring out the specific target(s) of the assessment, using a framework like ECD (i.e., the claims the assessor wants to make about the learners), and clarifying the intended goals and outcomes of the experience.

Authentic â€“ Activities and tasks used for assessment, like those that are designed for learning, should reflect real-world problems and encourage learners to think and act like a practitioner (Chinn & Malhotra, 2002; NRC, 2012; Singer, Hilton & Schweingruber, 2006), and require the integration of multiple kinds of knowledge and skill as they are used in practice (Darling-Hammond, Ancess, & Falk, 1995; Gulikers, Bastiaens, & Kirschner, 2004).

Directions for Future Research 

  • Assessment in informal learning contexts provide unique challenges related to scale, time, and unit of analysis. For example, when a learner doesn’t finish an activity at a museum setting and comes back later,  then how can we reliably claim what the learner learned from that experience?  Also, learners often work with others, ask for adults’ help or help other people which challenges the notion of individual as the unit of analysis.
  • Using external measures, such as surveys and tests, can impact the learning experience, potentially making it less fun or authentic. Future research needs to focus on how we can make assessment “embedded” within learning activities and spaces without interrupting the flow of the experience (see for example: Berland et al. 2013).
  • Protection – Defending Informal Science learning to threats of relevance (I am thinking we want to put something here based on Vanessa Svihla, University of New Mexico).  

The inclusion of interactive technology as a learning tool can provide the ability to embed assessment into the learning scenarios themselves, thus keeping the learner to be assessed while inside the experience. As they make decisions to move through the interactive learning experience, their journey should show increased competence, or show a quicker pace, for example, assuming those were the desired measures to assess. The experience design becomes integrally linked to the assessment.



Abell, N., Springer, D. W., & Kamata, A. (2009). Developing and validating rapid assessment instruments: New York, NY: Oxford University Press.

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association.

Baldwin, D., Fowles, M., & Livingston, S. (2005). Guidelines for constructed-response and other performance assessments. Princeton, NJ: ETS

Berland, M., Martin, T., Benton, T., Petrick Smith, C., & Davis, D. (2013). Using learning analytics to understand the learning pathways of novice programmers. Journal of the Learning Sciences22(4), 564-599.

Darling-Hammond, L., Ancess, J., & Falk, B. (1995). Authentic assessment in action: Studies of schools and students at work. Teachers College Press.

Chung, J., Cannady, M. A., Schunn, C., Dorph, R., & Bathgate, M., (2016) Measures Technical Brief: Fascination in Science. Retrieved from: 20160331.pdf

DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed.). Thousand Oaks, CA: Sage Publications.

DiCerbo, K. E. & Behrens, J. T. (2014). The impact of the digital ocean on education. [white paper] London: Pearson. Retrieved from

Driver, R., Asoko, H., Leach, J., Scott, P., & Mortimer, E. (1994). Constructing scientific knowledge in the classroom. Educational researcher,23(7), 5-12.

Gulikers, J. T., Bastiaens, T. J., & Kirschner, P. A. (2004). A five-dimensional framework for authentic assessment. Educational technology research and development52(3), 67-86.

(Greeno, 2006)

Kim, Y. J. (2014). Search for the optimal balance among learning, psychometric qualities, and enjoyment in game-based assessment (Unpublished doctoral dissertation). Florida State University, Tallahassee, FL.

Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational researcher20(8), 15-21.

Messick, S. (1994). The interplay of evidence and consequences in the validation of performance assessments. Educational Researcher, 23(2), 13–23.

Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). Focus Article: On the Structure of Educational Assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62.

Mislevy, R. J., Behrens, J. T., Dicerbo, K. E., & Levy, R. (2012). Design and discovery in educational assessment: Evidence-centered design, psychometrics, and educational data mining. Journal of Educational Data Mining, 4(1), 11–48.

National Research Council (2009). Learning Science in Informal Environments: People, Places, and Pursuits

Richards, J. C., & Schmidt, R. W. (2013). Longman dictionary of language teaching and applied linguistics. Routledge.

Russell, M. K., & Kavanaugh, M. (2011). Assessing students in the margin: Challenges, strategies, and techniques. IAP.

Stiggins, R. J. (2002). Assessment Crisis: The absence of assessment FOR learning. Phi Delta Kappan, 83(10), 758–765.