Spam cleaning thoughts ---------------------- JMF 15 Jan 2006 --------------- Need signature and clean regexps... options to clean automatically or offer for human check For each page, get the contents of the Edit form check it for match with any signature if match, select the offending part, allow human check. Should use TDD - need JavaScript unit test framework. Should do some analysis of the problem - build database of page name, updated by, IP address, date-time, versions, size TDD of this, too! Or at least, build from parts tried in IRB. Two key aspects: - find all page URLS - for given page, extract required data Relevant snippets - from list all http://wiki.rubyonrails.com/rails/pages
  • Dawnthorn
  • Alastair Moore
  • and from a specific page:
    Created on January 13, 2006 20:25 by Dawnthorn (24.7.72.59)
    Note: Home Page is different - /rails - ignore it for now. To find the
    , chop body into lines, iterate back from end to find this, then extract data from the following three lines. JMF 16th January 2006 --------------------- Used multiline regex to get the byline content - less code! TODO: 1. Get number of versions. Think about using this to retrieve metadata for all versions. Then can see the whole pattern of use, rather than just the "leading edge". Also could then go to incremental update of this history, driven by Recent Changes. 2. Look for spam signatures in displayed pages. 3. Get markup for analysis - this is what will have to be corrected. 4. Investigate concurrency control, or lack of it. JMF 17th January 2006 --------------------- What proportion of pages are logically empty? Create a clean read-only mirror? There is 13.2MB of HTML in the current pages. How much in the markup? And in the whole history? Versions - can see how many previous versions there are here: Pattern: