Leetcode 30 – Substring with concatenation of all words

Algorithm

Leetcode 30 – Substring with concatenation of all words

Source: https://leetcode.com/problems/substring-with-concatenation-of-all-words/

Problem statement

You are given a string s and an array of strings words. All the strings of words are of the same length.

A concatenated substring in s is a substring that contains all the strings of any permutation of words concatenated.

For example, if words = ["ab","cd","ef"], then "abcdef", "abefcd", "cdabef", "cdefab", "efabcd", and "efcdab" are all concatenated strings. "acdbef" is not a concatenated substring because it is not the concatenation of any permutation of words.

Return the starting indices of all the concatenated substrings in s. You can return the answer in any order.

Example 1:

Input: s = "barfoothefoobarman", words = ["foo","bar"]
Output: [0,9]
Explanation: Since words.length == 2 and words[i].length == 3, the concatenated substring has to be of length 6.
The substring starting at 0 is "barfoo". It is the concatenation of ["bar","foo"] which is a permutation of words.
The substring starting at 9 is "foobar". It is the concatenation of ["foo","bar"] which is a permutation of words.
The output order does not matter. Returning [9,0] is fine too.

Example 2:

Input: s = "wordgoodgoodgoodbestword", words = ["word","good","best","word"]
Output: []
Explanation: Since words.length == 4 and words[i].length == 4, the concatenated substring has to be of length 16.
There is no substring of length 16 is s that is equal to the concatenation of any permutation of words.
We return an empty array.

Example 3:

Input: s = "barfoofoobarthefoobarman", words = ["bar","foo","the"]
Output: [6,9,12]
Explanation: Since words.length == 3 and words[i].length == 3, the concatenated substring has to be of length 9.
The substring starting at 6 is "foobarthe". It is the concatenation of ["foo","bar","the"] which is a permutation of words.
The substring starting at 9 is "barthefoo". It is the concatenation of ["bar","the","foo"] which is a permutation of words.
The substring starting at 12 is "thefoobar". It is the concatenation of ["the","foo","bar"] which is a permutation of words.

Constraints:

1 <= s.length <= 10⁴
1 <= words.length <= 5000
1 <= words[i].length <= 30
s and words[i] consist of lowercase English letters.

Solution

For any starting index i we know exactly the total length of the substring we expect (sum of all of the word lengths). The challenge is to figure out the order of the words in that substring.

One approach is to move letter by letter and reject all the words which don’t mach a prefix we scanned so far. If at any point we end up with a prefix matching a complete word from our list, we mark that word as used, clear the prefix and continue with the process until the end. After the whole substring has been scanned, we mark the current starting index as one of the solutions only if we used all of the words in the list exactly once. We use a hash map to maintain a histogram of words. This can be optimized a bit further by interrupting the current starting index exploration even earlier if we notice that the word we just counted for was used more times than it exists in the list.

Now, in order to quickly traverse the list based on a incrementally built suffix we use a very simple implementation of suffix tree. Each node implicitly represents a single lowercase letter of the English alphabet. Pointers to 26 child nodes represent the next letter in the word. Each node’s member boolean is_end marks the letter as a final letter of a word in the list. This is used to top the traversal once we hit the node with is_end=true. Finding a word of length L in a list of words of length N is O(L), which is great.

Prefix tree example for a word list: [ “cat”, “catapult”, “category”, “game” ]. Red nodes represent end of the word.

Complete C++ implementation is given below.

class PrefixTree {
public:
    bool is_end = false;
    PrefixTree* next[26];
    
    // insert the word into the tree
    void insert(string word) {
        // start at the root
        PrefixTree *current = this;
        
        for (int i=0;i<word.size();i++) {
            char c = word[i]-'a';
            if (current->next[c] == NULL) {
                current->next[c] = new PrefixTree();
            }
            current = current->next[c];
        }
        
        current->is_end = true;
    }
    
};

class Solution {
public:
    
    vector<int> findSubstring(string s, vector<string>& words) {
        vector<int> res;
        
        int n = words.size();
        int total_n = 0;
        for (int i=0;i<n;i++) total_n += words[i].size();
        
        // construct the prefix tree
        // and word histogram
        PrefixTree *t = new PrefixTree();
        map<string, int> w_h;
        for (int i=0;i<n;i++) {
            t->insert(words[i]);
            w_h[words[i]] = w_h[words[i]] + 1;
        }
        
        // traverse the string trying to start from each letter
        for (int i=0;i<s.size() - total_n + 1;i++) {
            // move through the total number of letter concatenated
            PrefixTree *t_current = t;
            bool stop = false;
            string w;
            map<string, int> w_h_current;
            int j;
            for (j=0;j<total_n && !stop;j++) {
                char c = s[i+j]-'a';
                // current letter doesn't exist as a continuation of the word
                if (t_current->next[c] == NULL) {
                    // cout << "--!-- no child." << endl;
                    if (t_current->is_end) {
                        // start the new word
                        t_current = t; // reset the tree to the root
                        w_h_current[w] = w_h_current[w] + 1;
                        
                        // if we used the word more times than allowed...
                        if (w_h_current[w] > w_h[w]) { stop = true; continue; }
                        w = "";
                        j--;
                        continue;
                    } else {
                        stop = true;
                    }
                } else {
                    // move into the next char
                    w.push_back(c + 'a');
                    t_current = t_current->next[c];
                }
            }
            
            // handle the last word
            if (j == total_n) {
                if (t_current->is_end) w_h_current[w] = w_h_current[w] + 1;
                if (w_h_current[w] > w_h[w]) stop = true;
            }
            
            if (stop) continue;
            res.push_back(i);
        }
        
        return res;
    }
};

September 9, 2022 igorperic

Igor Perić