Hongcheng's notebook

The ideas behind a search engine

April 11, 2016 by Hongcheng

Search engines are so common nowadays. Google is the most popular web search engine. Facebook has a search engine for you to search for people, places, posts and other content you are interested in. Almost any website has a way to help you quickly find the specific information you are looking for on it, which is usually powered by an underlying search engine.

The general idea of a search engine looks very simple. You have a repository of documents, like web pages or Facebook posts. You build a software system based on this repository which answers keyword queries, like “X Y Z”, by returning a set of sorted documents that have some or all the keywords in the query.

The algorithm underneath a search engine looks very straightforward as well. Your search engine can go through all documents to find those documents which contain the keywords being searched.

The above ideas work for a small repository. However, this approach won’t work at all for a huge repository, say a billion document repository. This is because it may take forever for your software to go through every document before returning the results of your query.

This is a common phenomenon in computer engineering area. A problem works one way at a small scale, say a search on a few thousand documents. But when you try to extend to a much larger scale, say billions of documents, suddenly the problem looks very different.

An example is the Facebook website, which has almost 1.5 billion monthly active users and is completely different today from when it had only tens of thousands of users. Software engineers love to solve problems at scale, which is one of the biggest challenges in this area. There is an annual industrial conference called @scale to discuss solving problems at scale in computer engineering.

Back to building a search engine at scale. The smart idea is to first build an index, or an inverted index as it is called in technology terms. This lets the search engine quickly get all documents which contain the keywords in a query instead of going through every document on demand. An inverted index is very similar to the index of a thick book. It tells you in which documents you can find a specific keyword. The only difference is that an inverted index for a search engine is usually so huge that it can only be built by computers. It may even take thousands of computers to build.

So you can build a search engine for a huge repository using the technology of an inverted index, which needs a large amount of computation power to build. It can give you all documents which contain the keywords of a query.
However, that alone is far from a useful search engine. A huge part you are missing is document ranking. It’s very likely the inverted index gives you tens of thousands or even millions of documents for a query, because queries are usually short and documents are relatively much longer. It’s useless for a human to get such a big number of documents. Finding what you want in them is like searching for a needle in a haystack. Imagine if you had to look at the first 100 pages of Google’s search results before you found the web page you were looking for. Instead Google sorts all results according to some complicated algorithms so that people usually find what they want on the first Google search result page. That’s why Google is so popular and useful.

Document ranking is another big and challenging area about search engines. People have invented tons of smart algorithms to rank documents by lots of collected signals that are extracted from documents,queries and even the people who submit those queries.

Overall, 2 general ideas are behind a search engine: inverted index and document ranking. Both involve lots of techniques and have a lot of sub areas attracting enthusiastic engineers and researchers to put endless effort into improving them, little by little.

Posted in seach engine | Tagged search engine | Leave a Comment »

Things about C++ I didn’t know – implicit conversion while binding to a const reference

April 3, 2016 by Hongcheng

A type A can be bound to a function argument of type const B&, if there is an implicit ctor of B with an argument of A. The tricky part is a temporary object will be created and passed to the function as a const reference. If the function returns the argument itself or any member of it as a const reference, that reference will refer to a destructed object outside the function.

std::pair is a perfect example class for this, because it has a wild ctor.

template <class _U1, class _U2>
pair(const pair<_U1, _U2>& __p) : first(__p.first), second(__p.second) {}

Here is a complete example to show this.

Posted in Uncategorized | Leave a Comment »

Things about C++ I didn’t know – code block in if condition

January 16, 2016 by Hongcheng

I didn’t know in C++ you can write a code block, enclosed by ({}), as the condition of an if statement.

int main()
{
  if (({
    int sum = 0;
    for (int i =0; i &amp;lt; 1000; ++i) {
      sum += i;
    }
    sum == 500 * 999;
  })) {
  std::cout &amp;lt;&amp;lt; &amp;quot;Correct!&amp;quot; &amp;lt;&amp;lt; std::endl;
  }
  return 0;
}

Note that the last statement of the code block must be able to be evaluated to boolean value. Otherwise, the compiler will complain.

Posted in C++ | Tagged C++ | Leave a Comment »

香港恒生国企指数历史大底

December 3, 2015 by Hongcheng

恒生国企指数由在香港上市的中国大企业构成，其中金融业占比接近60%。目前该指数估值处于历史底部，PB=0.98，PE=7.2，分红率达到4%。

指数官网页面：http://www.hsi.com.hk/HSI-Net/HSI-Net 官网页面上可以查到最新的市盈率和股息率。

对应的ETF有恒生H股ETF, H股ETF和H股分级基金B。其中H股分级基金B带杠杆，而且这个杠杆的优势是不会被强制平仓，适合在指数进一步下探时补仓买入。当然杠杆也不是免费的，需要付出融资利率和相对基金净值的溢价。

该指数跟中国经济息息相关。其成分中的大国企正处在一个大变革期，国企改革。买入该指数的前提就是看好中国经济及其国企改革。

Screen Shot 2015-12-03 at 1.03.07 PM.png

2016年开始的第一个星期，由于人民币贬值预期和A股暴跌，港股国企指数进一步下跌，官网页面显示目前最新的股息率和市盈率分别是4.69和6.48。

Posted in stock | Tagged stock | Leave a Comment »

怎样提高工作效率

August 14, 2011 by Hongcheng

《怎样提高工作效率》译自 http://www.aaronsw.com/weblog/productivity by Aaron Swartz

“你用来看电视的时间都可以写本小说了”，有人常对我这样说。这样说很有道理–写本小说毫无疑问比看电视有价值–但是这个结论是有前提的，那就是花在不同事情上的时间是完全等价的可以“互换”的，所以可以简单的把看电视的时间拿来写小说，而实际上这个前提是不成立的。
人在不同的时间里有着不同的工作效率。如果我上了地铁却忘了带笔记本，我不大可能用乘地铁的时间写一篇文章出来。如果外界干扰比较多，你也不大可能集中注意力。情绪也会影响你的效率：当你开心并且有动力的时候，你的效率会比较高。而当你不开心的时候往往只能看看电视打发时间。
如果你想提高自己的工作效率，你不得不直面这个事实并利用它。首先，你要充分利用不同的时间。其次，你要尽量使你的时间有价值。
对于充分利用不同的时间我有以下几个诀窍。

选择正确的事情做

人生苦短，为什么要浪费时间在毫无价值的sb事情上呢？你可以从解决一些顺手的问题开始，但是你应该时常用这一标准来审视自己正在做的事情。是否有更加重要的事情你可以做？你为什么不做另一件事情？这些问题很难回答清楚（如果你坚持这样反省，最终你会问自己为什么没有去解决这个世界上最重要的问题），但是每前进一小步都会提高你的效率。

并行的做多件事情

一个常见的误解是如果你专心做一件事情，你的效率会最高。其实这是不对的。就拿我先在的情况来看，我同时在做许多事情：纠正错误姿势，健身，打扫，跟朋友在线聊天和写这篇文章。而在这一整天里，我做了更多的事情：写这篇文章，读书，吃饭，回了若干邮件，跟朋友聊天，给硬盘备份和整理书架。过去的一周里，我参与了好几个软件项目，读了好几本书，学了几门不同的编程语言，等等。
多件事情并行可以使你更好的利用不同效率的时间段。除此之外，当你在一件事情上遇到困难而止步不前或者感到厌倦时，你可以切换到其他事情上，这样可能会帮助你找到摆脱困境的方法。
多件事情并行也可以提高你的创造力。创造力正是来自将另一个领域的理论和方法应用于你要解决的问题。如果你有多个不同领域的问题要解决，那么你会有更多的想法应用到这些不同的领域。

把你要做的事情列一个清单

找一堆事情做并不难，大多数人都能找到许多他们想做的事情。但是如果你时刻记挂着这些要做的事情，你的大脑很快就会被淹没。试图去记住所有要做的事情会造成心里压力，从而使你焦躁不安。解决方法很简单：把要做的事情写下来。
一旦你有了一张清单，你可以按照类型来组织你要做的各种事情。比如，我的清单包括编程，写作，思考，外出，读书，倾听和观看等类型。
一般较大的事情有若干小任务组成。比如，我写这篇文章除是切切实实的写之外还涉及到阅读资料，思考文章的组织，精简句子，发邮件想别人请教一些问题，等等。每个小任务可以属于不同的类型，这样你可以在不同的适合时间里来完成不同的小任务。

让你的清单融入你的生活

有了清单以后，你还需要记得时常看一看。最好的办法就是让它变成一件自然而然的事情。比如，我总是在我的书桌上保持一摞书，最顶上是我正在读的书。每当我想读书的时候，我只需要拿起最顶上那一本开始读。
对于电影和电视，我差不多也这样做。任何时候我听说一部好电影我就把电影文件烤到我的电脑的一个特定的文件夹下。等我想看电影的时候，我只需要打开那个文件夹。
我还考虑过更加“侵入”式的方法。比如，当我检查别人的博客更新的时候，自动弹出一个网页提醒我需要阅读的文章清单。或者当我开始在电脑上无所事事的时候，一个窗口跳出来提醒我做一些事情。

对于使时间更有价值我有以下一些诀窍
除了充分利用时间，更重要的是使时间花得对你更有价值。大部分人都花大量的时间在上学或者工作上，如果你正在这两件事情上花时间，你应该立刻放弃。但是除此之外你能做些什么呢？

减轻物理上的约束

随身带笔记本和笔

我认识的有趣的人都会在口袋里放一个笔记本。纸和笔在任何情况下都是有用的，比如你想写下什么，做个笔记，记下一个想法，等等。我曾经在地铁上写完了多篇文章。（现在我已经用智能手机替代了笔记本，它不能像纸一样给别人一些信息，但是它让我总是有阅读的东西，并且我可以把我的笔记存入我的邮箱的收件箱，从而迫使我在稍后的某个时候处理这些笔记）。

想办法避免被外界打扰

如果你需要集中注意力解决一些问题，你不得不想办法避免被打扰。一个简单的办法是，去一个可能会打扰你的人找不到的地方。另一个办法是跟你身边的人达成协议：“办公室门关上的时候请不要打扰我”或者“用IM联系我，当我戴上耳机的时候”（这样你就可以忽略im直到你闲下来）。
但是你不能做得太过火，有时候你其实是在浪费时间，还不如被别人打扰。与其坐在办公室看报纸，不如帮别人解决问题。所以建立一个具体的协议是好主意：你可以被打扰，当你别没有真正的集中注意力。
减轻心理压力

吃好，睡好，坚持锻炼

当你感到饿或者累的时候，你不可能高效的利用时间。解决的方法和很简单，吃好，睡好并且坚持锻炼。我有时候并不能做到这几点。我不喜欢出去觅食，所以往往工作到又累又饿以至于不能去拿食物。
人们常对自己说，“虽然我很累但是我不能打个盹，因为我还有活要干。”实际上，如果你小憩一下，在余下的时间里你的效率会大大提高，况且你总归是要睡一觉的。
我并不是一个坚持锻炼的人，所以在这方面并不能给出好的建议，但是我通常会在我工作的地方尝试做一些运动。当我躺下来看书的时候，我会做做仰卧起坐。当我需要步行去哪里时，我会跑着去。

跟乐观有热情的人聊天

减轻心理压力非常难。结交乐观有热情的朋友有助于减轻心理压力。比如，每次我跟Paul Graham或者Dan Connolly聊天后，我总是觉得更有动力工作。他们这样有热情的人总是在向外“辐射“着能量，感染身边的人。人们想当然的认为离群索居深居简出更加利于工作，其实这样做容易使你情绪低落从而降低你的效率。

跟他人分担压力

即使你的朋友并不是那么令人鼓舞，跟别人一起解决一个困难的问题可以帮助你提高效率。首先，这样可以分担心理压力。其次，跟别人一起工作使你能够更容易集中注意力。

拖延症和心理排斥力场

以上这些方面其实只是在逃避问题。影响人们效率的真正问题其实是拖延症。不得不承认，每个有都有或轻或重的拖延症。但是这并不是你不想办法克服它的借口。

什么是拖延症呢？在旁人看来，你似乎是不干正事只是玩（比如打游戏，看新闻）。这使得你给别人留下懒惰的印象。其实我们需要知道的是：你到低在想些什么？
我曾花了许多时间来研究这个问题。我的发现是你的大脑建立了一个心理排斥力场。就像你把两块磁铁同极直接靠拢，他们会相互排斥。如果你从侧面靠拢他们，你就能感觉到磁场的边界。当你想把他们靠拢的时候，磁场会推开你。
心理排斥力场跟磁场非常类似，虽然看不见也摸不着，你也能同样的感觉到它的边界。你越是急切的要靠近它，它越是强烈的将你推开，结果是你不得不往相反的方向移动。
就像你不能光靠强力把极性相同的磁铁靠拢，你也不能光靠意志力来克服心理排斥力。你需要迂回作战，就像你需要把磁铁反转一样。
那么是什么造成了心理排斥力场呢？主要有两个因素：任务是否困难和任务是否是被分配的。
克服因为问题困难造成的心理排斥

分解问题

一类困难问题是因为问题太大。比如你要些一个菜谱管理程序。没人能够坐下来写一个菜谱管理程序。这是一个目标而不是任务。任务是具体的步骤，你能够通过完成这些步骤来达到目标。比如对于菜谱管理程序这个目标，你可以设定第一个任务“画一个显示菜谱的UI”。
一旦你完成一个这样的任务，下一步需要做什么就比较明显了。你需要决定一张菜谱包含些什么内容，需要哪些检索特性，菜谱在数据库中怎么存储，等等。一旦你启动了，你就会势如破竹般完成一个又一个的任务。等你的大脑开始真正关注你的问题时，这些问题就变得容易解决了。

简化问题

另一种困难问题是异常复杂复杂以至于令人生畏的。比如写一本书会让你觉得简直无从下手，那么你可以从写一篇文章开始。如果一篇文章还是太难，那么就从写一段摘要开始。总之，从你能立即开始动手做的事情开始。
一旦你做了一些工作，你就可以更好更准确的认识你要解决的问题。另外，在已有基础上进行改进比从一无所有开始工作。如果你的段落写得顺利，那么你就可以扩展到一篇文章甚至一本书，这正是积少成多，滴水穿石的道理。

持续的思考问题

问题的解决通常来自灵机一动。如果你对问题所在的领域不了解，那么你应该从调研这个领域开始。看看其他人怎么在这个领域怎么做，感受一下这个领域的风格。最后尝试解决这个领域的一些小问题作为练习，看你是否掌握了一些方法。
被指派的问题
被指派的问题是指那些别人命令你解决的问题。很多心理学实验发现当人们被物质激励去做一些事情的时候，人么反而不那么有动力去做或者反而把事情做得更糟。外部激励，比如奖励和惩罚，会消除人们的内在动力即人们对问题的自然而然的兴趣。人么不喜欢被告知做这做那。
有趣的是，这一现象不只对“别人”起作用，它也发生在当你告诉自己做这做那的时候。如果你对自己说，“我必须开始做X，它是目前对我最重要的事情。”，那么X立即会变成是世界上最难的事情。但是如果Y成为你目前最重要是事情，X会立即变得容易。

伪造一个别指派的问题

这一现象提供了一个非常简单的方法：如果你想做X，告诉你自己去做Y。不过要故意自己欺骗自己可不是那么容易的，因为你知道你到低需要做什么，所以你要做得狡猾一些。
一个方法是让别人指派事情给你。一个著名的实例是，大学毕业生要完成一篇论文才能毕业。为了避免写论文这件难事，结果毕业生做了很多其他难事。
伪造的问题必须看上去比较重要而且规模比较大，但实际上没没有重要到不能推迟或者放弃。

一定不要给自己指派问题

我们常对自己说，“我需要把其他任何事情放下来完成这篇论文。” 更糟糕的做法是通过物质激励，比如，“写完这篇论文我就可以去吃些糖果了。” 最糟糕的是请别人督促你做事情。
上面几个方法是大家常做的，我也都实践过，它们实际上是在降低你的效率，因为你本质上是在给自己指派问题，你的大脑会想尽办法摆脱你的指派。

想办法让问题变得有趣

努力工作比并不总是令人愉快的。实际上它可能是我最喜欢的事情。一方面，全副身心的投入使我感到充实，另一方面，解决问题后你会非常有成就感。

让你全副身心的投入某项工作的诀窍就是，不要告诉自己是不得不去做，而是说服自己这项工作很有趣。如果它实际上并不有趣，那么你就要想办法使它变得有趣。
我第一次用这个方法是在大学写论文的时候，写论文无疑是一件困难的事情，却是我不得不做的事情。谁会主动去写一篇论文来讲述两本毫无关系的书之间的联系呢？于是，我试着在论文中嵌入自己的小笑话。比如，我用不同的风格来写不同的段落，并且模仿名人的演讲风格。
另一个使工作变得有趣的方法是解决更一般问题。比如，你需要些一个web应用的时候，尝试去实现一个web应用框架，并把你要做的应用作为一个实例应用。这样做不但使问题变得有趣，而且产出可能会更有用。
总结
关于怎么提高工作效率人们有很多误解：不同时间是可互换的，强制自己集中注意力是有用的，物质激励自己是有用的，困难的工作是不可能有趣的，拖延症是不正常的。这些误解背后都有一个共同的观点：做真正有用的工作这件事是违反人的天性的。
对大多数人大多数工作而言，这个观点可能是正确的。你可以选择去写无聊的论文或者写毫无价值的备忘录。如果你不得不做这些事，那么你需要学会或多或少的让你的大脑停止工作从而不反感做这些事。
如果你要做一些有价值创造性的工作，让你的大脑停止工作可不行。提高效率的诀窍恰恰相反：顺从你的天性。饿了就吃饭，困了就睡觉，休个假如果你感到厌烦了，找有趣的项目做。
这些看上去都非常简单，并不设计到复杂的名词，强大的自我意志力或者来自成功人士的忠告。看上去更像是常识。但是社会的误解正在把我们推向相反的方向，所以如果你想提高效率，你只需要转身向相反的方向前进。

Posted in chinese, productivity | Leave a Comment »

An algorithm to find Minimum Mean Weighted Cycle

August 1, 2011 by Hongcheng

There is a wonderful algorithm to find the minimum mean weighted cycle of a edge-weighted directed graph. The algorithm is mentioned in one of exercises of CLRS. First calculate a 2D array dis[][] such that dis[k][v] is the length of the minimum k-path from a fake source node to node v. It is assumed that the fake source node has a zero weighted edge to each other node. If the path doesn’t exist, the length is considered infinite.
Then the mean weight of the MMWC is
Min { Max { (dis[n][v] – dis[k][v] ) / (n – k ) | k } | v }. *
The proof of the algorithm is as simple as the algorithm itself.
Assume that the MMWC of the graph has a mean weight 0. Let sdis[v] be the length of the shortest path from the source node to node v.
1. Max { (dis[n][v] – dis[k][v] ) / (n – k ) | k } must be non-negative, since if dis[n][v] is not infinite, there must be a cycle in its path and removing that cycle leads to a path with length dis[k][v] for some specific k. The total weight of the cycle is non-negative, so dis[n][v] – dis[k][v] >= 0.
2. Assume that a node v on a MMWC has the shortest path length from the source node among all nodes on that MMWC. It can be induced that for any node u on the MMWC,
sdis[u]=sdis[v] + cycle[v][u] ( cycle[v][u] is the length of MMWC path from v to u ). (1)
We can extend the sortest path from the source node to node v along the MMWC until we get a n-path at a cycle node u.
Assume that the shortest path from the source node to node u is a k-path, then from formula (1) we have dis[n][u] = dis[k][u], soformula (*) equals to 0.
3. If the mean weight of MMWC is t*, we can subtract t* from each edge’s weight and the proof still works.
It seems not trivial to find such a MMWC.

Posted in Uncategorized | 1 Comment »

Detect good or bad web content using link analysis

May 24, 2011 by Hongcheng

There are kinds of ranks and measures based on the link structure of the web to detect good or bad content. Some of interesting ones are described as follows.

Truncated PageRank
In a general form, Google’s pagerank is to calculate PR(P)=Sum{ damping(t)/N * P^t * I | t=0,1,2,…}, where P is the normalized adjacent matrix of the web graph and for any dangling node (0 out-degree node) v, edges from v to all nodes of the web graph are added to P. I is the identity vector. damping(t) is a function, and for Google’s computation, damping(t)=(1-a)a^t, where 1-a is considered as the probability of breaking out in the famous random surfer model and a=0.85 is suggested. We can say the PageRank of a page is collected from its supporters, which have links to it directly or indirectly. It is observed that spam pages collect their PageRanks from close supporters (spammers usually set up thousands of supporting pages), which are within a short distances. The motivation of Truncated PageRank is to make close supporters’ PagaRank contribution small. A simple means is to set damping(t)=0 for small value t, say t < T. If a page’s PageRank and Truncated PagaRank have a big gap, It is likely to be spam.
Estimation of Supporters
A supporter of distance D is a page which has a shortest path of length D to the target page.This idea is based on the same intuition as Truncated PageRank that spam pages have a different supporter distribution. The numeric feature to be used is the so-called bottleneck number,
Bk(x)= Min{ | supporter(x,j)|/|supporter(x,j-1)| , 0<j<=k} which indicates the smallest increasing rate of supporter numbers from 1 to k. The real challenge comes from the computation and it’s impossible to calculate exact supporter numbers within a given distance for each page. Some probabilistic counting method is employed here. A basic idea is to generate a bitmap for each node each bit of which is set with probability e, and run D Bellman-Ford relaxations, where the relaxation for each a–>b is to make bitmap(b)|=bitmap(a). Finally, we can estimate the supporter numbers by the bitmaps. The basic algorithm may not work well, and some refinement is used, for example, run multiple times of Bellma-Ford using different e.
Graph Clustering
The intuition is that closely connected pages (hosts) are similar.
One method is to cluster pages(hosts) and use the labeled pages in each cluster to predict the whole cluster. For example, if the average spam probability of the cluster is larger than a predefined threshold, then the whole cluster is predicted as spam. If smaller than another predefined threshold, then the whole cluster is predicted as non-spam.Another method is so-called Stacked Graphical learning. Say we have a learning algorithm to predict each page’s probability of spam, then we calculate the average spam probability of each page’s neighbors( in-link, out-link or both), and use the value as a new input feature to construct another supposed more powerful learning algorithm.
TrustRank
Instead of detect spam, TrustRank is used to recognize non-spam. If the surfer breaks out to a random page according to some pre-defined probability distribution A in the random surfer model, the described previously PageRank’s formula can be generalized as PR(P,A)=Sum{ damping(t)/N * P^t * A | t=0,1,2,…}. Making entries of Acorresponding to some trusted pages non-zeroes and others zeroes, we would get so-called TrustedRank. If a page is not linked by some trusted pages directly or indirectly, it would have a low TrustedRank.Picking up and labeling seeds are very costly. Some tricks can be used to select more efficient seeds to cover more pages. For example, select pages with large inverse PageRank. A large inverse PageRank indicates a page links to many pages directly or indirectly.
SpamRank (BadRank)
The idea is similar to the TrustRank, but the restart distribution A indicates probability of spam pages. It looks like propagating spam probability through in-links.
SimRank
A general measure of similarity between objects using relations. It has a recursive definition
S(a,b) = C/| Neib(a)|/|Neib(b)| * Sum{ S( c,d ) | c in Neib(a) & d in Neib(b) }
where Neib(a) is the neighbor set of object a. The neighbor relation can be defined arbitrarily, e.g. web page’s out-linked pages. C is the damping factor. The base case is S(a,a)=1.The underlying intuition is that if two objects’ neighbors are similar, then the two objects are similar too. In terms of a random surfer model, S(a,b) can be considered as the expected hops two surfers will need to meet, if at each step them both move to one of the neighbors of the current objects. More precisely speaking, S(a,b)= Sum { Prob(P(l)) * C ^ l | l=0,1,2,…} where P(l) is a path of the graph G^2 ending with some node [x,x]. Due to the computational space and time complexities, there are some randomized algorithms to approximate SimRank.

Posted in Uncategorized | Tagged algorithm, link analysis | Leave a Comment »

R tips

April 29, 2011 by Hongcheng

data=read.table(“file.dat”,header=true,sep=”,”)
file.dat is
A,B,C
1,2,3
4,5,6
7,8,9
data$A refers the column named “A”,…
read.csv(“file.csv”) reads csv file.
data1 <- edit(data) : edit your data in a UI. It’s pretty cool.
par(mfrow=c(3, 2)) : show 2X3 canvas. The succeeding painting function will fill up the canvas.
plot(data$A): plot points (1,data$A[1]), (2,data$A[2]) …
plot(data$A,data$B): plot points (data$A[1],data$B[1]),…
hist(data$A): histogram of data$A

Posted in Uncategorized | 1 Comment »

Min hash outline

December 19, 2010 by Hongcheng

Min hash is an implementation of Local Sensitive Hash. We have a set of binary vectors from a high dimensional feature space. The similarity between two vectors is defined as Jaccard similarity: Jaccard(V1,V2)=| V1 & V2 | / | V1 | V2 |. We want to find pairs of vectors of high similarity.
To avoid the time consuming pairwise similarity calculation, a heuristic method is to generate a group of hash values for each vector and only calculate similarities of pairs of vectors which have at least one hash value in common. Min hash is such a hash function, Hmin(V)= min { i | V[P[i]]=1 }, where P[] is a random permutation of features. Min hash function has a good property that
Prob{ Hmin(V1)=Hmin(V2) }=Jaccard(V1,V2). The property guarantees that we wouldn’t miss many pairs.

Posted in Uncategorized | Leave a Comment »

Ubuntu下用tor翻墙

December 11, 2010 by Hongcheng

到Tor project下载一个tor浏览器。（需要翻墙……）
解压后用脚本start-tor-browser启动tor浏览器。
在出来的control panel的Setup Relaying–>Network 勾选”My ISP blocks connections …”。
发送标题为”get bridges”的邮件到bridges@torproject.org获取三个bridges地址和端口。将bridge通过control panel地址加入配置。
连接成功后会出现tor浏览器自带的firefox浏览器。
可以使用其他浏览器，只需要配置代理服务器Localhost:8118
最好使用如下的自动代理配置脚本。
function FindProxyForURL(url, host)
{
if (
dnsDomainIs(host, “facebook.com“)||
dnsDomainIs(host, “twitter.com“)||
dnsDomainIs(host, “appspot.com“)||
dnsDomainIs(host, “blogspot.com“)||
dnsDomainIs(host, “wordpress.com“)||
dnsDomainIs(host, “torproject.org“)||
shExpMatch(url,”*wordpress.com“)||
shExpMatch(url,”*torproject.org“)
)
return “PROXY 127.0.0.1:8118“;
else
return “DIRECT”;
}

Posted in Uncategorized | Leave a Comment »

Older Posts »

Hongcheng's notebook

The ideas behind a search engine

Things about C++ I didn’t know – implicit conversion while binding to a const reference

Things about C++ I didn’t know – code block in if condition

香港恒生国企指数历史大底

怎样提高工作效率

An algorithm to find Minimum Mean Weighted Cycle

Detect good or bad web content using link analysis

R tips

Min hash outline

Ubuntu下用tor翻墙

Archives

Categories

Pages

Blogroll

Meta