Language Models in the Classroom: Bridging the Gap Between Technology and Teaching

Over two years since the release of ChatGPT, the anticipated “EdTech Revolution” has yet to happen. Despite impressive technical advance in large language models in recent years, despite substantial resource investments and the widespread implementation of large language models in edtech products, the usability of these tools in K-12 classrooms remains surprisingly stagnant and low. According to a nationally representative survey of educators from October 2024, only 2% of teachers use generative AI-based tools in their classrooms “a lot,” 68% have never used them, and 36% don’t even plan to start. What’s even more surprising is that these numbers have hardly changed from 2023 to 2024.

What gets lost in the pursuit of scalable technical solutions for a deeply social, bespoke (and political) problem space? What does it take for language technologies to truly empower teachers and students?

These questions shaped our work in our Stanford class Empowering Educators Via Language Technology, or CS293/EDUC473. Over the course of the quarter, we heard from industry leaders at Amplify, Schoolhouse World, TeachFX, and TeachingLab, as well as researchers from Google DeepMind and leading research universities. We read papers from top natural language processing (NLP) and AI conferences, and we worked hands-on with educational data spanning various domains, including textbook content, classroom discourse and student essays. We explored how language models measure instruction quality, generate feedback, evaluate essays, simulate students and teachers, and support chat-based tutoring. We annotated data and applied a range of methods, from lexical analyses to prompting, fine-tuning, and reinforcement learning techniques.

We share our reflections as core questions for research and practice at each stage in the machine learning development pipeline.

Reflection 1: Problem Definition

Who are we designing for?

Less than five percent of edtech users engage with tools at the “recommended dosage” to be considered in studies—a phenomenon known as The Five Percent Problem. That means 95% of people aren’t using these tools, often because the tools don’t address their needs and/or are too difficult to use. We must recognize that as technologists we are likely part of the small group for which edtech is considered “effective.” What blind-spots and assumptions therefore are baked into our designs? If we do not engage thoughtfully and design humbly for the 95%—to build useful and usable technologies for all learners and teachers—we risk developing systems that reinforce existing inequities rather than address them.

What is the pedagogical objective?

Technologies in education are often framed as efficiency tools—automating grading, feedback, lesson planning, assessment, and even tutoring—but these systems do not “understand” student learning and risk producing output that is pedagogically unsound. Not all automation is helpful. Writing feedback, structuring lessons, and assessing student understanding are not just logistical burdens but personal and highly contextual pedagogical acts. Rather than maximizing automation, development should be grounded in well-defined objectives of instructional quality.

Are the right people involved?

Every decision in technology development reflects underlying values about what education should be. Engaging key stakeholders—including teachers, students, administrators, and parents—is essential to ensuring language technologies align with real educational needs and values. Co-design and iterative feedback from human experts and stakeholders should happen throughout every stage of development, rather than waiting until the products are deployed. Research methodologies from the Human-Computer Interaction (HCI) field offer actionable models for this kind of collaboration. For example, in one longitudinal study, a researcher co-taught a class with a performing arts educator for over two years, iteratively co-designing and observing the use of classroom tools in real time. This deep, sustained engagement allowed the researcher to trace the entire lifecycle of the tools, from design to daily implementation.

Reflection 2: Data Specification

Does the data fairly represent our world?

Much of the data used to train AI in education comes from convenience samples—reusing datasets collected over a decade ago (like the National Center for Teacher Effectiveness (NCTE) dataset of math classroom transcripts or Automated Student Assessment Prize (ASAP) essays) or collecting samples with selection bias through willing data partners. These datasets are tied to specific grade levels, subjects, and environments, making them difficult to generalize beyond their original context. Using such data risks out-of-distribution harms. Would a model recognize student reasoning when presented with code-switching or non-dominant dialects? Would an auto-grader unfairly penalize those whose writing styles do not align with academic language norms? If our datasets don’t represent our target population of teachers and students, we risk creating tools that work well for some and fail others. We must also continue collecting fair and diverse datasets that capture the full spectrum of learning contexts and experiences.

Does the data capture enough complexity?

Moreover, we studied tools built on text-based data, but teaching and learning are multimodal. A tutoring chatbot can be built on transcripts alone, but real tutoring interactions are shaped by tone, facial expressions, gestures, gaze, and visibility toward what students write, draw, or point to. As researchers, we have to ask ourselves more sincerely what it is we are trying to approximate and determine when oversimplification may pose a risk or harm. For example, younger and/or multilingual students may especially rely on drawing to explain their thinking, so determining whether they are “reasoning” based on verbal expressions alone may result in problematic false negatives.

What are the standards of data quality?

Educational data can be inherently noisy. The recorded data may be riddled with mistranscriptions and typos, and key moments of teaching and learning that we aim to model may occur rarely. Data annotation often involves high inference constructs where two humans may not agree when executing the same task (e.g., is this feedback too vague? Too harsh?). Moreover, the contextual nuances of classroom interactions, such as tone and intent, are often lost in raw data. Without fixed rules to handle these ambiguities, we who do the model work must get hands-on with data cleaning and labeling, to truly understand and reveal the complexities and data limitations underlying model performance.

Reflection 3: Modeling

Is an LLM really necessary?

Despite the popularity of large language models like GPT-4, their high computational and environmental costs impact the pricing and accessibility of tools. When considering sensitive student or teacher data, these large models can pose further privacy risks as they are often accessed through third-party APIs. In contrast, smaller models can often be fine-tuned for specific tasks while maintaining comparable performance at a lower cost. For example, RoBERTa is effective for text classification, while LLaMA or Mistral (7B) can handle closed-domain question answering, topic clustering, and summarization. Classical NLP algorithms using the frequency of n-grams to classify, predict, or cluster text remain valuable as well, offering faster inference and more interpretability. These LLM alternatives can offer greater control over data privacy through local hosting; however, more engineering effort may be required to deploy these techniques at scale.

What do technical paradigms afford and constrain?

Modeling in education requires transforming something fluid and context-dependent—like student thinking or teaching practices—into structured data. In doing so, information is inevitably lost—every supervised label and every text summary rounds out details that matter in real classrooms. General-purpose LLMs struggle with these nuances, as most pedagogy is too complex to be captured through prompting, even with techniques such as few-shot or chain-of-thought. Retrieval-augmented generation (RAG) is promising in educational contexts, allowing models to pull in a wealth of texts and resources rather than relying on pre-trained knowledge. Techniques like reinforcement learning through human feedback (RLHF) and direct preference optimization (DPO) can help align models with classroom needs, but they are prone to reward hacking when “teacher preferences” are poorly understood. What are the notions of quality that teachers prefer? Is a preferred output necessarily pedagogically sound? (Note: DPO is particularly cost effective, requiring little training data. As a class assignment, we trained a somewhat passable tutor chatbot with just 20 labeled examples.)

Who performs the modeling?

Teachers should be involved at every stage of modeling—from designing annotation schemas to evaluating intermediate model outputs. Modeling strategies should also allow for teacher customization, enabling them to specify evaluation criteria, adjust evaluation mechanisms, or refine model outputs to better fit their classroom needs. With the rise of LLMs and AI-powered code editors like Cursor or Replit Agent, teachers now have the opportunity to build and customize their own tools. As AI becomes more accessible, educators can take an active role in shaping how these models function. Building communities and shared spaces can facilitate such educator-led development: for example, the Stanford EduNLP lab is hosting a summit this summer to bring together math teachers to collaboratively shape research on and the development of language technologies in math education.

Reflection 4: Evaluation

Who is qualified to evaluate what?

We must critically consider who is qualified to evaluate model outputs. From the examples we observed in our course, we found the term “expert” loosely defined. Evaluating the instructional quality of generated teacher actions or the authenticity of simulated student actions require deep pedagogical expertise and domain understanding. Human evaluators may lack the necessary knowledge, highlighting the need to define the qualifications and credentials that represent “expertise” (e.g., experience in teaching, pedagogical training, subject-matter expertise, socio-technical understanding of AI). Even more contentious is the common use of other LLMs to evaluate model outputs, which can cause reinforcement of biases, lack interpretability, and create epistemically circular reasoning.

Which metrics matter?

Evaluation involves a range of uncoordinated metrics—companies highlight usage rates and survey responses, model developers report accuracy and latency, and human evaluations of model outputs often rely on high-level preferences assessed in lab settings. To prioritize the speed of iteration, in-context testing with real-world classrooms over time remains rare. These disconnects mean that many tools are optimized for metrics that do not truly capture meaningful learning outcomes. Meaningful evaluation of efficacy requires authentic educational contexts and predefined metrics that matter most to students and teachers.

Reflection 5: Deployment

What are the hidden costs?

Economic viability is a key consideration for edtech adoption. The price of AI-powered products comes at a premium, as the most powerful underlying technologies like large language models are increasingly gated by paywalls. For schools, cost calculations go beyond licensing fees. Hidden expenses include professional development used as training time, technical support, infrastructure requirements, and the opportunity cost of adopting new tools when existing ones might suffice. There is also the risk of relying on private companies for essential educational services—recent failures like the abrupt closure of FEV Tutoring highlight the instability of outsourcing public goods to private actors. In most cases, the biggest costs of ineffective or failed technology adoption are born not by those who purchase a tool (a district or school administrators), but by the end users: teachers and students. Addressing the misalignment in incentives (tech providers maximizing profit rather than impact) could help reduce hidden costs associated with tech adoption.

Who bears the risks?

Beyond equity of access, there are ethical questions about testing AI on real students. While in-context evaluation is important, such testing needs to be done carefully, in multiple phases, analogically to clinical trials. Education research is uniquely high-stakes—its “measurable outcomes” directly impact students’ future opportunities, career prospects, and long-term trajectories. Yet, time and again, vulnerable communities become the testing ground for new technology, bearing the risks of unproven interventions while wealthier students continue to learn from highly skilled educators. Deploying AI in education demands careful scrutiny—not just of its potential benefits, but of who is most impacted when it fails. One principle that our course adopted is to ensure that AI’s interaction with students is mediated through teachers, both to reduce risk and to center the human connection.

Are we transparent about the limitations?

Bias, hallucinations, and sycophantic tendencies (the model saying what the user wants to hear) are well-documented issues in large language models. These limitations are unlikely to be solved by developers who rely on off-the-shelf models, which offer little control over how the systems were trained. Mitigating their risks and shortcomings must come through transparency. Rather than overstating AI’s capabilities, developers must clearly communicate what these models cannot do and set realistic expectations. This includes providing detailed documentation on known failure modes, what data the model was trained/tuned on, how it was evaluated, and relevant disclaimers to mitigate the risk of harm.

Authors: Graduate School of Education PhD student Mei Tan, Assistant Professor in Education Data Science at the Graduate School of Education Dora Demszky (instructors) and students in CS293 (ordered alphabetically by last name; indicates editing contribution): Javokhir Arifov, Philip Baillargeon, Nathanael Cadicamo, Joshua Delgadillo, Eban Ebssa, Elizabeth Gallagher, Rebecca Hao, Matías Hoyl, TJ Jefferson, Ashna Khetan, Aakriti Lakshmanan, Lucía Langlois, Daniel Lee, Samantha Liu*, Yasmine Mabene*, Chijioke Mgbahurike, Shubhra Mishra, Cameron Mohne, Alex Nam, Kaiyu Ren, Poonam Sahoo*, Yijia Shao, Mayank Sharma*, Ziqi Shu, Alexa Sparks, Nicholas Tuan-Duc Vo*, Gordon Yeung.

How was this blog post written? Each student wrote a few paragraphs of reflection based on class discussions, readings, assignments and lectures. Students Rebecca Hao, Samantha Liu, and Yasmine Mabene synthesized the main ideas within each theme and facilitated an in-class small group activity where all students and instructors workshopped considerations, principles, recommendations and questions. Based on these inputs, Mei drafted the blog post by stitching together students’ reflections. (Interestingly, ChatGPT was not at all helpful in this process, as it could not synthesize while keeping students’ original writing intact). Instructor Dora Demszky and students Mayank Sharma, Poonam Sahoo, and Nicholas Tuan-Duc Vo then edited the blog post, shaping the final version through iterative revisions.

* Indicates editing contribution

link