Fine Tuning an LLM to Grade Language Learning Prompts - Rick's Public Notes

- background - the task - data set, training costs: ~1,200 examples cost $1.41 - Original GPT-4 prompt - Manually reviewed collected data. - Two orders of magnitude cheaper due to needing fewer hints and prompt tokens. - cost reduction - past attempts - the training data - Possible issues: - Since many of my cards (~1,200) exist in the initial training set, there may be "confirmation bias" and the model might see a reduction in quality as new cards are added. - The fine tuned model is more permissive about minor grammatical issues whereas GPT-4 often rejects grammar mistakes. This might be fixable with more fine tuning but would require native speaker labeling. - Still better than self-grading via Anki, since GPT-3.5 has a better understanding of Korean than I do.