October 07 Changes (HHH) -------------------------- 1. Action: Altered "natural" rule From: BadHostParts[i++] = "natural"; To: BadHostParts[i++] = "natural[^il]"; Reason: Thu Oct 4 19:23:52: www.naturallist.com analysis revealed no "naturali" and only two "naturall" porn hosts. Of the two, one was dead and the other actually had some images blocked. 2. Action: Altered "lady" rule From: BadURL_WordEnds[i++]="lady"; To: BadHostWordEnds[i++]="lady"; Reason: Here are the false positive URLs: Thu Oct 5 07:46:17: yourutahjob.com/images/main_lady_top.jpg Thu Oct 5 07:46:17: yourutahjob.com/images/main_lady_left_side.jpg Thu Oct 5 07:46:17: yourutahjob.com/images/main_lady_right_side.jpg I already moved BadURL_WordStarts[i++]="lady"; to BadHostWordStarts[i++]="lady"; I called that change a *** PARTIALLY RESOLVED **** I was wrong. BOTH of these rules ARE NOW resolved. 3. Action: Altered nude "rule" From: BadURL_Parts[i++] = "nude"; To: BadHostParts[i++] = "nude"; Reason: Thu Oct 5 08:35:08 www.weber.k12.ut.us/themes/wsd/images/menudepartments.gif I can't always prevent the runTogetherWords like this, but we have the following URL rules: BadURL_WordStarts[i++]="nude"; BadURL_WordEnds[i++]="nude"; Using Start and End rules, but eliminating the Parts rules we have: 5888 nude_Parts.txt 4128 nude_Starts_and_Ends.txt 684 nude_Passed_All_Rules.txt ------------------------------- 10700 total I did patterns for 1, 2, 3, and 4 characters before the pattern "nude" and only came up with one neutral word, "denude". For 1 character after we have "nuder" and "nudes". For 2 characters after we have "nudely" and "nudest". If you replace the "e" with an "i" then we would also have "nudist". There are no words with 3 characters after, and only one with 4, "nudeness". All have the root "nude" with modifiers with the trailing characters. I looked at the the nude_Passed_All_Rules.txt file and could deduce (induce?) no general rules. Here are the patterns in LETTER: precede-count, follow-count. a: 37, 7 b: 2, 26 c: 15, 23 d: 27, 6 e: 103, 1 f: 12, 3 g: 13, 40 h: 23, 2 i: 7, 3 j: 1, 0 k: 19, 0 l: 66, 5 m: 4, 18 n: 65, 9 o: 22, 2 p: 4, 86 q: 0, 0 r: 33, 4 s: 60, 365 t: 78, 8 u: 4, 4 v: 0, 1 w: 10, 58 x: 4, 0 y: 73, 2 z: 4, 11 That means we have no: [ q | v ]nude or nude[ j | k | q | x ] Further, the one neutral word I can find, "denude" gives me 103 host names with nude in them that are preceded by "e". NUFF! 4. Action: Revised some "nudi" rules From: BadURL_Parts[i++] = "nudist"; BadURL_Parts[i++] = "nudity"; To: BadURL_Parts[i++] = "nudis"; BadURL_Parts[i++] = "nudit"; Reason: "nudis" adds 28 hosts, all "nudism". "nudit" adds 16 hosts, which seems to be evenly split between Francais "nudite" and Espanol "nudita". There is no telling how many in URLs that adds. 5. Action: Added "nudie" rule Added: BadURL_Parts[i++] = "nudie"; Reason: 29 host names will get away and who knows how many URLs. I removed "nudie" from left overs of the previous and there were 29 of them. Until we have false positives, I have now way of knowing whether I have pushed it too far. Only time will tell. This is just a cutesy English way of saying it.