Interview

Interviewee

Wenchao Zhou
Associate Professor
Department of Computer Science

His research interests center on the use of data-centric and formal techniques towards ensuring safe and secure distributed systems. He is also interested in distributed data management and the application of database technologies to networked systems.

Script

I’m Wenchao Zhou, I’m an associate professor at the department of computer science. My research interests are in databases, networking and computer system.

So Weibo trending, right.

Essentially, Weibo trending is trying to reflect what are the most popular search words on Weibo or any other platform which also has this type of features. So if you just think about it, it’s actually a very simple thing. Like you have so many searches at the same time and you just want to find which terms or what is the phrase that gets hits the most.

I mean in period time, right. So if you just look at it that way, it’s not difficult to calculate. You just got all those data and you found what’s the most frequent items. That’s it.

Then the difficulty comes when you actually skilling to weibo’s, like a skill. You literally have millions of searches at any hour. So the most difficult part is how we can do it efficiently.

There are two challenges in implementing this feature, one is how you can find what are the phrases. I mean, like people search for the same item using different phrases. So you want to summarize what is the phrase, like a common phrase, that people are hitting. For instance, there are some actual weddings, some people may just type the groom’s name and they may type in the other parts’ names. So how do you make sure people actually are hitting the same term they are actually typing different terms.

This is one difficulty and this is something that relies on, say, like advanced machine learning, AI, and NLP, try to understand what is the actually the term that people are trying to search. That’s the first thing.

The second thing is that how can you do it efficiently. So you want to refresh the frequent items, really very frequently. You want to refresh what’s the most frequent item every minute or every second. Then the challenge here is how you can do this on the platform that consists of hundreds of thousands of machines. This distributed stream processing, is actually the technology behind this, to enable the computation of the most frequent items at such scale. To allow this to do, like this stream processing, you need to have lots of machines collaborating with each other. Everybody just searches for the small partition of the search. And then at the end just aggregate everything together.

Then the technology behind that is essentially distributed the system, like the networking and databases.