Statement of Research Interests


From a substantive standpoint, my primary research interests relate to the topics of educational philosophy, intellectual assessment, and school/teacher effectiveness. From a methodological standpoint, my primary areas of interest relate to psychometric theory, research methodology, and statistical analysis. A further elaboration of my work within each of these areas is described below.


1. Educational Philosophy

The current educational and political context in the United States places a heavy emphasis on assessment and accountability. Yet, for assessments to be valid, it is critically important that they be aligned with program objectives (Tyler, 1990). Although many researchers have criticized the high-stakes testing movement on the basis of psychometric weaknesses of the assessments used, few empirical studies exist that have systematically examined the extent to which the accountability mechanisms used in education are appropriately aligned with the formal purposes of schooling. In an effort to fill this void, I have conducted a series of empirical research studies that have examined the formal purposes of schooling by looking at various sources of evidence, including school mission statements, legal court documents, state constitutions, and opinions gathered from student interviews (Stemler & Bebell, 1999; 2004). The results of these studies have shown that various sources of evidence converge upon five major purposes of formal schooling. These purposes are: 1) cognitive development; 2) social development; 3) emotional development; 4) civic development; and 5) vocational preparation.

Perhaps the most important message of this line of research is that although the development of the cognitive domain is an important purpose of schooling, it is only one of many purposes. Furthermore, there is no inherent precedence given to the development of cognitive skills within any of the sources of evidence we have analyzed. Rather, all purposes are emphasized equally. In spite of this fact, however, the assessment of the cognitive domain traditionally has received the most weight and tends to be the area for which students, teachers, and schools are held accountable. I believe that rather than eliminating the accountability system, a better solution would be to broaden the range of skills assessed so that they more adequately reflect the formal purposes of schooling articulated by various legal, political, and individual sources. Thus, my second major area of interest relates to the development of broader measures of intelligence and achievement that move beyond the skills assessed by traditional measures.



Stemler, S. E., & Bebell, D. (1999, April). An empirical approach to understanding and analyzing the mission statements of selected educational institutions. Paper presented at the New England Educational Research Organization (NEERO), Portsmouth, NH.

Stemler, S. E., & Bebell, D. (2004). The fit between the purpose of schools and student assessment in Massachusetts. Manuscript submitted for publication.

Tyler, R. W. (1990). Basic principles of curriculum and instruction. Chicago: University of Chicago Press.


2. Intellectual Assessment

            The findings from my research on educational philosophy suggest that an educated individual possess a balance of cognitive, social, emotional, civic, and vocational skills. Consequently, an important question to ask is how do we go about assessing these skills? It is perhaps the case that the assessment of cognitive skills is held up as the gold standard because of the long tradition of research supporting the reliable and valid assessment of cognitive skills. Yet, in this era of advanced technology, we are now more than ever in a position to create reliable and valid measures of broader constructs. Toward this end, I have been a collaborator on the Rainbow project at Yale, directed by Dr. Robert J. Sternberg. The goal of the project is to develop supplemental assessments of creative and practical skills that can be used to augment the SAT. The major finding from our preliminary effort was that the use of supplemental measures nearly doubled the predictive validity of the SAT while at the same time reducing the achievement gap often observed between White students and traditionally underrepresented minority students (Sternberg, et al., 2004).

The Rainbow project was interesting not only for the promising substantive findings, but also because of the advanced methodologies used to gather data. Specifically, the project involved an incomplete overlapping design requiring the use of Full Information Maximum Likelihood (FIML) to estimate population parameters. In the summer of 2003, I was selected to participate in an APA summer institute in Longitudinal Modeling at the University of Virginia. Participation in this summer institute allowed me to better understand the use of Structural Equation Modeling to model growth, as well as to gain a deeper understanding of the power of incomplete overlapping designs and the use of FIML for their analysis.

Although as a field, we know the most about the assessment of the cognitive domain, there is still much more work to be done in this area. Indeed, much of the work in the assessment of the cognitive domain has emphasized the assessment of memory or analytical skills. Yet, Sternberg (1997; 1999) has suggested that cognitive skills may be more accurately described as consisting of memory, analytical, creative, and practical skills. Thus, to truly assess higher order thinking skills, tests for college admission or advanced placement should include items that tap a broad range of cognitive skills.

In a recent study, funded by the College Board, we put our ideas to the test on a large scale. The goal of this project was to develop augmented measures of AP Psychology and AP Statistics that contained items tapping not only analytical and memory skills, but also practical and creative skills and to correlate these results with the results from the actual AP examinations. The results of the study showed that students exhibited (sometimes extreme) differences in patterns of strengths and weaknesses across the various cognitive skill areas. Furthermore, similar to the findings from the Rainbow project, we found that when students were assessed using items that tap a broader range of cognitive skills, ethnic differences in achievement were drastically reduced (Stemler, Grigorenko, Jarvin, & Sternberg, 2004). My colleagues and I recently received funding from the National Science Foundation that will allow us to extend this work to the domain of AP Physics.



Stemler, S. E., Grigorenko, E. L., Jarvin, L., & Sternberg, R. J. (2004). The theory of successful intelligence as a basis for enhancing Advanced Placement instruction and assessment. Manuscript submitted for publication.

Sternberg, R. J. (1997). Successful intelligence: How practical and creative intelligence determine success in life. New York: Plume.

Sternberg, R. J. (1999). The theory of successful intelligence. Review of General Psychology, 3, 292-316.

Sternberg, R. J., & The Rainbow Project Collaborators. (2004). The Rainbow Project: Enhancing the SAT through assessments of analytical, practical, and creative skills. Manuscript submitted for publication.


3. School and Teacher Effectiveness

My third major area of research interest relates to school and teacher effectiveness. My interest in the field of school effectiveness grew out of my dissertation research conducted at Boston College while working in the TIMSS International Study Center. My dissertation research involved a secondary analysis of the Third International Mathematics and Science Study (TIMSS 1995). Using Hierarchical Linear Modeling, I examined the explanatory power of a particular model of school effectiveness for elementary school mathematics and science across 14 different countries. Each participating country had at least 150 schools. The major finding of the project was that within each country tested, the most effective schools in mathematics were inhabited by students with a high internal locus of control (Rotter, 1966). This finding held across 12 out of 14 countries (Stemler, 2001), even after controlling for SES differences among students within each country.

            After finishing my doctoral degree, I continued this line of research through a postdoctoral fellowship at the Yale University PACE Center where I currently direct a U.S. DOE-funded project related to school effectiveness. One of the goals of the project is to test, through the use of Structural Equation Modeling, the relative explanatory power of dominant empirical models of school effectiveness against two relatively recent theoretical models (Sternberg, 2000; Leithwood, 2000). We are currently in the fourth year of this five year project, and data analyses are ongoing.

Although my interest has traditionally been in the area of school effectiveness research, in the past few years, several research studies in this field have begun to show that the majority of variance in student achievement may be better explained by teacher level variables than by school level variables (Ayres, Sawyer, & Dinham, 2004; Hill & Rowe, 1996). Consequently, many researchers in the field have begun to shift their focus away from an examination of school level variables and toward a deeper examination of teacher level variables that may help to explain student achievement. During the past two years I have also begun to shift my focus toward an emphasis on better understanding the variables associated with teacher effectiveness. In particular, I am currently in the process of developing and testing a theory related to teachers’ practical skills in dealing with others. Just as there is more to schooling than the development of cognitive skills, some colleagues and I just submitted a paper arguing that there is more to teaching than instruction (Stemler, Elliott, Grigorenko, & Sternberg, 2004). Our developing theory is placed within the broader framework of Sternberg’s theory of successful intelligence. In order to test our ideas, we have created a series of tacit knowledge inventories for teachers. The inventories present a stem (problem situation) followed by response options representing the following seven behavioral strategies typically used by teaches in solving situation that arise on the job: (1) comply (do what is asked of you by whomever asks), (2) consult (ask someone else for advice), (3) confer (articulate the problem to the source), (4) avoid (ignore the problem or avoid the situation), (5) delegate (pass the responsibility for dealing with the problem onto someone else), (6) legislate (institute new policies for dealing with situations like the novel one being encountered), or (7) retaliate (Act toward an aggressor the way they act toward you).

We are currently in the process of empirically testing our developing theory through the use of discriminant analysis. Using data gathered from a national sample of K-12 teachers, we are examining whether the use of certain strategies differentiates more v. less effective teachers in different contexts (dealing with supervisors, peers, and subordinates). We are also examining the extent to which systematic differences in strategy selection may be observed between teachers on the basis of the following factors: gender, ethnicity, urbanicity of the school, years of experience teaching, and number of schools in which the teacher has taught. The project has involved the participation of visiting scholars from seven different countries (U.S., Mexico, Spain, Germany, Russia, Canada, and England). In addition, we have recently begun a cross-cultural extension of the project in England in order to compare the strategies used by beginning and advanced teachers in England and the U.S.



Ayres, P., Sawyer, W., & Dinham, S. (2004). Effective teaching in the context of a Grade 12 high-stakes external examination in new South Wales, Australia. British Educational Research Journal, 30(1), 141-165.

Hill, P. W., & Rowe, K. J. (1996). Multilevel modeling in school effectiveness research. School Effectiveness and School Improvement, 7, 1-34.

Rotter, J. B. (1966). Generalized expectancies for internal versus external control of reinforcement. Psychological Monographs, 80 (1, Whole No. 609).

Stemler, S. E. (2001). Examining school effectiveness at the fourth grade: A hierarchical analysis of the Third International Mathematics and Science Study (TIMSS). Dissertation Abstracts International, 62(03A), 919.

Stemler, S. E., Elliott, J., Grigorenko, E. L., & Sternberg, R. J. (2004). There's more to teaching than instruction: Seven strategies for dealing with the social side of teaching. Manuscript submitted for publication.

Sternberg, R. J. (2000). Making school reform work: A "Mineralogical" theory of school modifiability. Phi Delta Kappan Fastback, 457.

Leithwood, K. (Ed.). (2000). Understanding schools as intelligent systems. Stamford, CT: Jai Press.


4. Psychometric Theory

From a methodological standpoint, my primary area of research interest has to do with psychometric theory. I am particularly interested in Rasch measurement theory because of the versatility of the Rasch model. Although from a mathematical standpoint, the Rasch model appears to be a special case (i.e., one parameter) of an item response model, there are important differences in the assumptions of the IRT and Rasch measurement that differentiate them. Specifically, the Rasch model is a measurement model, and IRT models are statistical models. I am currently in the process of preparing a paper that elaborates upon the fundamental differences between Rasch measurement and IRT.

In attempting to assess constructs such as social and emotional intelligence, creativity, and practical skills, it is often useful to rely upon open-response items. The major limitation of open-response items it that they must be scored by raters and then incorporated into a single estimate of student ability. One important issue in this area has to do with how interrater reliability is gauged. I have recently published an article arguing that although interrater reliability is typically described in statistical and psychometric texts as a unitary construct, there are important differences in the approach to estimating interrater reliability that one chooses. I have argued that various methods for computing interrater reliability may be classified into: consensus, consistency, and measurement approaches. Each of these three approaches carries with it different purposes, assumptions, and different implications for how the data are best summarized. This seemingly innocuous and pedantic topic is actually critical to legal defensibility of high-stakes tests. For example, when a consistency estimate is reported, but the data are summarized in a way associated with a consensus estimate, this can threaten the validity of the inferences made from the test. When appropriately used, the many-facets Rasch model provides a powerful solution to many challenges faced in estimating interrater reliability and combining scores from open-response and multiple-choice items. Currently, however, this technique is not widely used. Thus, one of my major areas of research interest relates to the dissemination of the use of this powerful approach.



Stemler, S. E. (in preparation). Understanding the difference between Rasch measurement and item response theory: Knowing when to cross the line.

Stemler, S. E. (in press). Automated essay scoring: A human's review [Review of Automated essay scoring]. Contemporary Psychology.

Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4), Available online:


5. Research Methods and Statistical Analysis

My fifth area of research interest relates to the use of research methods and various statistical techniques. My fundamental belief is that the research methods used must accurately reflect the nature of the questions under investigation. Furthermore, it is often the case that quantitative and qualitative methods can be used quite effectively in conjunction. For example, in my own research, I have often found content analysis to be a useful tool for uncovering big ideas in narrative texts (see Stemler, 2001). After a content analysis has been conducted, it is then possible to use more quantitative techniques to make comparisons along the relevant dimensions (e.g., comparing the extent to which teacher and principals mention physical resources as an important factor related to school effectiveness).

As a result of my experiences in teaching courses overseas with Framingham State College’s International Educational Program, I have begun work on an introductory Research Methods book with my colleague, Dr. Bill Murphy (Stemler & Murphy, in preparation). The book is practically oriented and is designed to introduce students to research design through the analysis of newspaper articles and news reports to which students are exposed every day. After using this media to master basic skills such as identifying variables, research questions, and sampling designs, students then begin to read research journal articles and master more advanced skills of such as choosing the appropriate methodology and analytic technique for a particular research question.

Within the area of statistical analysis, my chief concern has been with understanding how various analytical techniques relate to one another and knowing which techniques are most useful in certain contexts. As a result of this interest, I am in the process of writing a book on choosing the appropriate statistical techniques for data analysis (Stemler, in preparation). Traditional texts tend to be focused either on univariate statistics or on multivariate statistics. I have found that dividing texts this way often leaves students without an overarching conceptual framework of how the various statistical techniques work together. By contrast, the framework I use is pragmatic and oriented not only toward the number of variables involved in analysis, but also their fundamental nature (i.e., categorical v. continuous). The goal of my book is to create a short and quick guide that will allow people to understand the connection among the various techniques and to understand the major assumptions of each technique, how to test the assumptions, what to do when the assumptions are violated, and to provide some examples of the kinds of questions best answered by each technique.



Stemler, S. E. (2001). An overview of content analysis. Practical Assessment, Research and Evaluation, 7(17), Available online:

Stemler, S.E. (in preparation). Choosing the appropriate statistical analysis method for your data.

Stemler, S.E., & Murhpy, B. (in preparation). A practical guide to evaluating research.