kmp algorithm,

Detecting Plagiarized Property Descriptions with the KMP Algorithm

Detecting Plagiarized Property Descriptions with the KMP Algorithm

Zillow wants to keep its platform trustworthy by ensuring that all property descriptions are unique and high-quality. Duplicate or copied descriptions can harm user experience and lower trust in the platform. To prevent this, Zillow can use the KMP (Knuth-Morris-Pratt) algorithm to find and stop plagiarism in property descriptions. This algorithm helps quickly compare new descriptions with existing ones, making it easier to spot any duplicates. By using this method, Zillow can ensure that the content on their site remains original and reliable, enhancing user trust and overall platform quality.

KMP (Knuth-Morris-Pratt) Algorithm

The KMP algorithm is efficient for pattern matching and can quickly locate substrings within a larger string. It preprocesses the pattern to create a partial match table, which is then used to skip unnecessary comparisons during the search.

Implementation

  1. Preprocess Existing Descriptions
    • Build a database of existing property descriptions.
    • Preprocess these descriptions using the KMP algorithm to create a partial match table.
  2. Scan New Descriptions
    • For each new property description, use the KMP algorithm to compare it against the database.
    • Apply KMP for exact substring matching.
  3. Flag Potential Plagiarism
    • If a match or close match is found, flag the description for further review.
  4. Review and Take Action
    • Review flagged descriptions manually to confirm plagiarism.
    • Take appropriate action, such as rejecting the description or asking the user to revise it.

Algorithm

algorithm kmp_search:
    input:
        an array of characters, S (the text to be searched)
        an array of characters, W (the word sought)
    output:
        an array of integers, P (positions in S at which W is found)
        an integer, nP (number of positions)

    define variables:
        an integer, j  0 (the position of the current character in S)
        an integer, k  0 (the position of the current character in W)
        an array of integers, T (the table, computed elsewhere)

    let nP  0

    while j < length(S) do
        if W[k] = S[j] then
            let j  j + 1
            let k  k + 1
            if k = length(W) then
                (occurrence found, if only first occurrence is needed, m  j - k  may be returned here)
                let P[nP]  j - k, nP  nP + 1
                let k  T[k] (T[length(W)] can't be -1)
        else
            let k  T[k]
            if k < 0 then
                let j  j + 1
                let k  k + 1

Time and Space Complexity

  • Time Complexity: O(M + N ) M:substring length N:Text length
  • Space Complexity: O(M)

Here is the full code implementation Code