Similarity search on Stack Overflow posts

Stack Overflow posts


In [6]:
%%sql -d standard
SELECT 
  id, title, body, view_count, tags
FROM
  `gcp-samples2.stackoverflow_demo.top100K_posts`
ORDER BY
  view_count DESC
LIMIT
  5


Out[6]:
idtitlebodyview_counttags
927358How to undo last commit(s) in Git?<p>I committed the wrong files to <strong>Git</strong>.</p> <p>How can I undo that commit?</p>4077668git|git-rebase|git-commit|git-reset|git-revert
5585779Converting String to Int in Java?<p>How can I convert a <code>String</code> to an <code>int</code> in Java?</p> <p>My String contains only numbers and I want to return the number it represents.</p> <p>For example, given the string <code>"1234"</code> the result should be the number <code>1234</code>.</p>3825994java|string|type-conversion
1789945How to check if one string contains another substring in JavaScript?<p>Usually, I would expect a <code>String.contains()</code> method, but there doesn't seem to be one. What is a reasonable way to check for this?</p>3427502javascript|string|substring|string-matching
2003505How to delete a Git branch both locally and remotely?<p>I want to delete a branch both locally and on my remote project fork on <a href="http://en.wikipedia.org/wiki/GitHub" rel="nofollow noreferrer">GitHub</a>.</p> <h3>Failed Attempts to Delete Remote Branch</h3> <pre><code>$ git branch -d remotes/origin/bugfix error: branch 'remotes/origin/bugfix' not found. $ git branch -d origin/bugfix error: branch 'origin/bugfix' not found. $ git branch -rd origin/bugfix Deleted remote branch origin/bugfix (was 2a14ef7). $ git push Everything up-to-date $ git pull From github.com:gituser/gitproject * [new branch] bugfix -&gt; origin/bugfix Already up-to-date. </code></pre> <p>What do I need to do differently to successfully delete the <code>remotes/origin/bugfix</code> branch both locally and on GitHub?</p>3416408git|git-branch|git-remote
503093How do I redirect to another page in jQuery?<p>How can I redirect the user from one page to another using jQuery?</p>3382579javascript|jquery|redirect

(rows: 5, time: 2.9s, 101MB processed, job: job_JSvvfpcgJf536AYJv0dtpy3Hqh6J)

Segmentation


In [7]:
%%sql -d standard
CREATE TEMPORARY FUNCTION segmentation(body STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """

// remove HTML tags, convert to lowercase and extract words
return body.replace(/(<[^>]+>|&#[^;]+;)/g, '').toLowerCase().match(/\\w\\w+/g);

""";

SELECT
  id, 
  segmentation(CONCAT(title, body, tags)) as words
FROM
  `gcp-samples2.stackoverflow_demo.top100K_posts`
LIMIT
  5

# (for full query on 10M posts takes about 30 secs with 14.1GB)


Out[7]:
idwords
927358['how', 'to', 'undo', 'last', 'commit', 'in', 'git', 'committed', 'the', 'wrong', 'files', 'to', 'git', 'how', 'can', 'undo', 'that', 'commit', 'git', 'git', 'rebase', 'git', 'commit', 'git', 'reset', 'git', 'revert']
5585779['converting', 'string', 'to', 'int', 'in', 'java', 'how', 'can', 'convert', 'string', 'to', 'an', 'int', 'in', 'java', 'my', 'string', 'contains', 'only', 'numbers', 'and', 'want', 'to', 'return', 'the', 'number', 'it', 'represents', 'for', 'example', 'given', 'the', 'string', '1234', 'the', 'result', 'should', 'be', 'the', 'number', '1234', 'java', 'string', 'type', 'conversion']
1789945['how', 'to', 'check', 'if', 'one', 'string', 'contains', 'another', 'substring', 'in', 'javascript', 'usually', 'would', 'expect', 'string', 'contains', 'method', 'but', 'there', 'doesn', 'seem', 'to', 'be', 'one', 'what', 'is', 'reasonable', 'way', 'to', 'check', 'for', 'this', 'javascript', 'string', 'substring', 'string', 'matching']
2003505['how', 'to', 'delete', 'git', 'branch', 'both', 'locally', 'and', 'remotely', 'want', 'to', 'delete', 'branch', 'both', 'locally', 'and', 'on', 'my', 'remote', 'project', 'fork', 'on', 'github', 'failed', 'attempts', 'to', 'delete', 'remote', 'branch', 'git', 'branch', 'remotes', 'origin', 'bugfix', 'error', 'branch', 'remotes', 'origin', 'bugfix', 'not', 'found', 'git', 'branch', 'origin', 'bugfix', 'error', 'branch', 'origin', 'bugfix', 'not', 'found', 'git', 'branch', 'rd', 'origin', 'bugfix', 'deleted', 'remote', 'branch', 'origin', 'bugfix', 'was', '2a14ef7', 'git', 'push', 'everything', 'up', 'to', 'date', 'git', 'pull', 'from', 'github', 'com', 'gituser', 'gitproject', 'new', 'branch', 'bugfix', 'gt', 'origin', 'bugfix', 'already', 'up', 'to', 'date', 'what', 'do', 'need', 'to', 'do', 'differently', 'to', 'successfully', 'delete', 'the', 'remotes', 'origin', 'bugfix', 'branch', 'both', 'locally', 'and', 'on', 'github', 'git', 'git', 'branch', 'git', 'remote']
503093['how', 'do', 'redirect', 'to', 'another', 'page', 'in', 'jquery', 'how', 'can', 'redirect', 'the', 'user', 'from', 'one', 'page', 'to', 'another', 'using', 'jquery', 'javascript', 'jquery', 'redirect']

(rows: 5, time: 3.8s, 100MB processed, job: job_ZXi_Gt46RrARsG9PdHzKwIzJKw0h)

Calculate Feature Vectors (TF-IDF) for each post


In [8]:
%%sql -d standard
CREATE TEMPORARY FUNCTION calc_tf_idf(words ARRAY<STRING>)
RETURNS STRING
LANGUAGE js AS """

// count each word in this post
var word_count = new Object();
for (word of words) {
  if (word_count[word]) {
    word_count[word]++;
  } else {
    word_count[word] = 1;
  }
}

// calculate TF-IDF values for each word
// tf = word count / total word count
// idf = log (100K posts / # of posts with the word)
// tf_idf = tf * idf
var total_posts = 100000;
var words_in_post = words.length;
var tf_idf = new Object();
var norm_sum = 0;
for (word in word_count) {
  if (word_dict[word] == null) {
    word_dict[word] = 1;
  }
  var tf = (word_count[word] / words_in_post);
  var idf = Math.log(total_posts / word_dict[word]);
  tf_idf[word] = tf * idf;
  norm_sum += tf_idf[word]^2;
}

// normarizing TF-IDF values with L2 norm
for (word in tf_idf) {
  tf_idf[word] = (tf_idf[word] / Math.sqrt(norm_sum)).toFixed(5);
  if (tf_idf[word] < 0.01) {
    delete tf_idf[word]; // remove trivial words
  }
}

return JSON.stringify(tf_idf);

"""
OPTIONS (
  library="gs://gcp-samples2-stackoverflow/word_dict_0.js",
  library="gs://gcp-samples2-stackoverflow/word_dict_1.js",
  library="gs://gcp-samples2-stackoverflow/word_dict_2.js",
  library="gs://gcp-samples2-stackoverflow/word_dict_3.js",
  library="gs://gcp-samples2-stackoverflow/word_dict_4.js",
  library="gs://gcp-samples2-stackoverflow/word_dict_5.js",
  library="gs://gcp-samples2-stackoverflow/word_dict_6.js"
);

SELECT
  id, calc_tf_idf(words) AS tf_idf
FROM
  `gcp-samples2.stackoverflow_demo.top100K_posts_segmented` AS posts
LIMIT
  10

# (for full query on 10M posts takes 50 secs with 12.2GB)


Out[8]:
idtf_idf
953918{"align":"0.01234","middle":"0.02801","horizontally":"0.01613","width":"0.02588","pagei":"0.01700","div":"0.01583","tag":"0.01047","800px":"0.03974","greater":"0.01551","shouldn":"0.01415","stretch":"0.01731","bring":"0.01650","alignment":"0.01820","center":"0.01136"}
826782{"highlighting":"0.01953","css":"0.01210","textselection":"0.01131"}
7172784{"deneme":"0.01015"}
1135245{"list":"0.01981","mysql":"0.05330","user":"0.01815","accountsi":"0.04435","command":"0.01096","utility":"0.02158","navigate":"0.02199","through":"0.01165","database":"0.01131","accounts":"0.02371","version":"0.01087","mysql5":"0.03634"}
40471{"differences":"0.06406","between":"0.03702","hashmap":"0.11006","hashtable":"0.13619","are":"0.01037","java":"0.02999","which":"0.01079","more":"0.01644","efficient":"0.03149","non":"0.02561","threaded":"0.04579","applications":"0.02915"}
689746{"column":"0.01140","null":"0.01373","nulls":"0.02555","sql":"0.01003","alter":"0.01886"}
27509{"detecting":"0.05480","an":"0.01953","undefined":"0.11835","object":"0.06718","propertywhat":"0.09054","best":"0.02644","way":"0.01359","checking":"0.03959","property":"0.03064","javascript":"0.04559"}
2275004{"java":"0.02386","do":"0.01005","check":"0.03054","if":"0.01164","string":"0.01930","contains":"0.01766","substring":"0.05549","ignoring":"0.06329","case":"0.03167","two":"0.01324","strings":"0.02059","str1":"0.07470","str2":"0.07486","contained":"0.02893","within":"0.01740"}
507138{"add":"0.01687","class":"0.01931","element":"0.01551","lt":"0.01027","div":"0.02112","someclass":"0.01787","img":"0.01029","image1":"0.03617","javascript":"0.01260","dom":"0.01227","manipulation":"0.01619"}
415953{"generate":"0.06685","md5":"0.15156","hash":"0.12230","there":"0.01039","any":"0.01159","method":"0.01888","string":"0.01521","java":"0.03759","hashcode":"0.05860"}

(rows: 10, time: 10.5s, 89MB processed, job: job_t8Zqfc0krhckGJISOhivaFBigY6w)

Similarity Search with Feature Vectors


In [9]:
%%sql -d standard
CREATE TEMPORARY FUNCTION calc_similarity(tf_idf_json_0 STRING, tf_idf_json_1 STRING)
RETURNS FLOAT64
LANGUAGE js AS """

// parse JSON to extract tf_idf
var tf_idf_0 = JSON.parse(tf_idf_json_0);
var tf_idf_1 = JSON.parse(tf_idf_json_1);

// calculate cosine similarity
var similarity = 0;
for (word in tf_idf_0) {
  var t0 = tf_idf_0[word] ? Number(tf_idf_0[word]) : 0;
  var t1 = tf_idf_1[word] ? Number(tf_idf_1[word]) : 0;
  similarity += t0 * t1;
}

return similarity;
""";

SELECT
  title,
  body,
  tags,
  similarity
FROM
  (
    SELECT
      t1.id, 
      calc_similarity(tf_idf_0, t1.tf_idf) AS similarity
    FROM
      (
        SELECT tf_idf AS tf_idf_0
        FROM `gcp-samples2.stackoverflow_demo.top100K_posts_tf_idf` AS t0
        WHERE id = 92082 
      )
    CROSS JOIN
      `gcp-samples2.stackoverflow_demo.top100K_posts_tf_idf` AS t1
    ORDER BY
      similarity DESC
    LIMIT
      10
  )
JOIN
  `gcp-samples2.stackoverflow_demo.top100K_posts` AS t2
USING (id)  
ORDER BY
  similarity DESC

# (for full query on 10M posts takes about 16 secs with 16 GB)

# try other posts : 
# 5585779 (String to Int in Java)
# 1789945 (substring in JS)
# 92082 (add a column in SQL Server)
# 503093 (redirecting in jQuery)


Out[9]:
titlebodytagssimilarity
Deleting columns in MS SQL Server<p>How do I delete a column from an existing table?</p>sql-server|sql-server-20050.0160030167
Add a column with a default value to an existing table in SQL Server<p>How can a column with a default value be added to an existing table in <a href="http://en.wikipedia.org/wiki/Microsoft_SQL_Server#Genesis" rel="nofollow noreferrer">SQL Server 2000</a> / <a href="http://en.wikipedia.org/wiki/Microsoft_SQL_Server#SQL_Server_2005" rel="nofollow noreferrer">SQL Server 2005</a>?</p>sql|sql-server0.0148238179
How is data stored in SQL server?<p>How is data stored in SQL server?</p>sql|sql-server-20080.0145842942
Create a one to many relationship using SQL Server<p>How do you create a one to many relationship using SQL Server?</p>sql|sql-server|sql-server-20050.0144546887
Set a existing column of MS SQL table as NOT NULL<p>How to Set a existing column of MS SQL table as NOT NULL? </p>sql-server|sql-server-20050.0130666625
Add not null Column in existing table in SQL<p>How to add not null Column in existing table in SQL Server 2005?</p>sql-server0.0128557815
Trunc(sysdate) in SQL Server<p>What is the equivalent of:</p> <pre><code>TRUNC(SYSDATE) </code></pre> <p>...in SQL Server 2005?</p>sql|sql-server|sql-server-2005|oracle|tsql0.0128514049
SQL update undo<p>Is there a way we can undo a SQL update query?</p>sql|sql-update|undo0.0127105902
Convert a string to int using sql query<p>How to convert a string to integer using SQL query on SQL Server 2005?</p>sql|sql-server-20050.0126629017
How do I check if a SQL Server datetime column is empty?<p>How do I check if a SQL Server datetime column is empty?</p>sql|sql-server|datetime0.0117317336

(rows: 10, time: 5.2s, 116MB processed, job: job_QvT2egKJWRTk1yVHkqKLpYoVH1oM)

In [ ]: