In some languages bad words are generative, not fixed. It makes it pretty much impossible to make a fixed list of badwords.
In Nordland county in Norway animal names, name of genetalia, and a few other names (often in the local dialect) is combined to produce a badword. Fun thing is that what seems like a bad word can be taken as a boast in some situations. To confuse the situation even more some of the names are common names used for other things. A “peis” is a fire place, but also a penis. Lets make an example. If you have asked a friend over to do some work, and he doesn't show up, and then you call him a “måspeis” (“dick of a seagull”) he may just laugh of it – you called him a small jerk. If you call him a “hæstpeis” (“dick of a horse”) he might hit you – you called him a big jerk. Now assume you go on a party with your friend and you are asked in the door who he is, and you say he is a “måspeis”, then he might hit you. If you call him a “hæstpeis” he might give you a beer.
To make this somewhat simpler both “måspeis” and “hæstpeis” are informals that should not show up in articles. To make it harder they are not listed in any dictionaries. To make it even harder there are a lot of other combinations; “apskjit”, “mainskjit”, “hæstskjit”, “torskskjit”, “apekuk”, “mainkuk”, “hæstkuk”, “torsk-kuk”, “ap-peis”, “mainpeis”, “hæstpeis”, “torskpeis”, osv. The large set of variations will seem to be just noise in a k-means algorithm.
Some years ago I was pretty sure I had found all combinations, the list was short of 2000 badwords. Then I got a list from another source, and the list had suddenly over 7000 badwords and it was not complete.
I made a better description a few years back at m:Grants:IdeaLab/BadWords detector for AbuseFilter/Technical description. I also posted a nearly the same at m:Research talk:Revision scoring as a service/Word lists/no#Badwords.