Over the years, the NetBase Platform team has solved many interesting, yet challenging, problems and built a powerful analytic platform that is capable of analyzing hundreds of millions of documents per day with near real-time latency. Here is a quick look at a few components across the NetBase analytic platform.
We run many flavors of data connectors and downloaders connecting to a variety of data sources, including Twitter, Facebook, Instagram, Tumblr, YouTube, Sina Weibo, Tencent Weibo, and other news, blogs, and forum sites. Some data connectors are streaming endpoints, some are RESTful APIs, and some use SFTP to push files through. Every day, we receive and persist over 150 million documents; our live archive has access to over 100 billion documents.
Normalization and Indexing
The platform dispatches each document that arrives at our facility to a cluster of hundreds of machines that are organized under an in-house-built, proprietary, MapReduce framework. Junk and spam contents are rejected at an early stage and the filtered documents are normalized, indexed, and merged to a searchable document store powered with inverted indexes. The whole pipeline is built to be always available and to scale horizontally (with more servers) along with our needs.
Natural Language Processing
One specialty of NetBase’s indexing procedure is text analysis using powerful NLP. Our NLP code is capable of:
- Detecting 40+ languages
- Running geographic tagging at the city and DMA level
- Applying acronym recognition and anaphora resolution
- Recognizing, categorizing, and assigning sentiment to emojis and emoticons
- Identifying gender, demographic profiles, etc.
- Deep-parsing texts in 10+ languages to graphs of tokens and applying information extraction to the graphs to find hidden insights
Distributed Query Cluster
The NetBase query cluster follows the same scaling-out principle and distributes all searchable document stores across hundreds of servers. Replications and automatic failover are put in place to maximize system availability.
The query cluster, however, is not just about searching text—many user queries need flexible matching and specific supports to cover different problem domains. To meet those needs, we implemented many types of query supports, such as structured searching through Boolean operators, wildcard matching, word proximities, time series counting with different time units, histogramming, distinct value counting, etc.
An in-house query infrastructure was developed to provide access to distributed data stores in the same way that SQL does for relational databases. Queries are submitted to the query infrastructure through a common API, then the infrastructure processes the queries within the distributed cluster and returns results to users.
The NetBase analytic platform offers its users powerful capabilities, able to scale rapidly to accommodate any enterprise. Reach out with any questions. We’re happy to dive deeper into any aspect of our platform and help you address specific pain points.
To learn more about NetBase sign up for a demo today here.