![]() Text cleaning is hard, but the text we have chosen to work with is pretty clean already. Use your task as the lens by which to choose how to ready your text data. If we were interested in classifying documents as “ Kafka” and “ Not Kafka,” maybe we would want to strip case, punctuation, and even trim words back to their stem.If we were interested in developing a Kafkaesque language model, we may want to keep all of the case, quotes, and other punctuation in place.Nevertheless, consider some possible objectives we may have when working with this text document. We are going to look at general text cleaning steps in this tutorial. I’m sure there is a lot more going on to the trained eye. “II” and “III”), and we have removed the first “I”. There does not appear to be numbers that require handling (e.g.There’s a lot of use of the em dash (“-“) to continue sentences (maybe replace with commas?). ![]()
0 Comments
Leave a Reply. |