Searching the Web UVA - 1597

語言: CN / TW / HK

  The word “search engine” may not be strange to you. Generally speaking, a search engine searches the web pages available in the Internet, extracts and organizes the information and responds to users’ queries with the most relevant pages. World famous search engines, like GOOGLE, have become very important tools for us to use when we visit the web. Such conversations are now common in our daily life:

“What does the word like ∗ ∗ ∗ ∗ ∗∗ mean?”
“Um. . . I am not sure, just google it.”

  In this problem, you are required to construct a small search engine. Sounds impossible, does it? Don’t worry, here is a tutorial teaching you how to organize large collection of texts efficiently and respond to queries quickly step by step. You don’t need to worry about the fetching process of web pages, all the web pages are provided to you in text format as the input data. Besides, a lot of queries are also provided to validate your system.

  Modern search engines use a technique called inversion for dealing with very large sets of documents. The method relies on the construction of a data structure, called an inverted index, which associates terms (words) to their occurrences in the collection of documents. The set of terms of interest is called the vocabulary, denoted as V . In its simplest form, an inverted index is a dictionary where each search key is a term ω ∈ V . The associated value b(ω) is a pointer to an additional intermediate data structure, called a bucket. The bucket associated with a certain term ω is essentially a list of pointers marking all the occurrences of ω in the text collection. Each entry in each bucket simply consists of the document identifier (DID), the ordinal number of the document within the collection and the ordinal line number of the term’s occurrence within the document.

  Let’s take Figure-1 for an example, which describes the general structure. Assuming that we only have three documents to handle, shown at the right part in Figure-1; first we need to tokenize the text for words (blank, punctuations and other non-alphabetic characters are used to separate words) and construct our vocabulary from terms occurring in the documents. For simplicity, we don’t need to consider any phrases, only a single word as a term. Furthermore, the terms are case-insensitive (e.g. we consider “book” and “Book” to be the same term) and we don’t consider any morphological variants (e.g. we consider “books” and “book”, “protected” and “protect” to be different terms) and hyphenated words (e.g. “middle-class” is not a single term, but separated into 2 terms “middle” and “class” by the hyphen). The vocabulary is shown at the left part in Figure-1. Each term of the vocabulary has a pointer to its bucket. The collection of the buckets is shown at the middle part in Figure-1. Each item in a bucket records the DID of the term’s occurrence.

  After constructing the whole inverted index structure, we may apply it to the queries. The query is in any of the following formats:

term
term AND term
term OR term
NOT term

  A single term can be combined by Boolean operators: ‘AND’, ‘OR’ and ‘NOT’ (‘term1 AND term2’ means to query the documents including term1 and term2; ‘term1 OR term2’ means to query the documents including term1 or term2; ‘NOT term1’ means to query the documents not including term1). Terms are single words as defined above. You are guaranteed that no non-alphabetic characters appear in a term, and all the terms are in lowercase. Furthermore, some meaningless stop words (common words such as articles, prepositions, and adverbs, specified to be “the, a, to, and, or, not” in our problem) will not appear in the query, either.

  For each query, the engine based on the constructed inverted index searches the term in the vocabulary, compares the terms’ bucket information, and then gives the result to user. Now can you construct the engine?

在這裡插入圖片描述

Input

  The input starts with integer N (0 < N < 100) representing N documents provided. Then the next N sections are N documents. Each section contains the document content and ends with a single line of ten asterisks.

**********

   You may assume that each line contains no more than 80 characters and the total number of lines in the N documents will not exceed 1500.

   Next, integer M (0 < M ≤ 50000) is given representing the number of queries, followed by M lines, each query in one line. All the queries correspond to the format described above.

Output

  For each query, you need to find the document satisfying the query, and output just the lines within the documents that include the search term (For a ‘NOT’ query, you need to output the whole document). You should print the lines in the same order as they appear in the input. Separate different documents with a single line of 10 dashes.

----------

  If no documents matching the query are found, just output a single line: ‘Sorry, I found nothing.’. The output of each query ends with a single line of 10 equal signs.

==========

Sample Input

4
A manufacturer, importer, or seller of
digital media devices may not (1) sell,
or offer for sale, in interstate commerce,
or (2) cause to be transported in, or in a
manner affecting, interstate commerce,
a digital media device unless the device
includes and utilizes standard security
technologies that adhere to the security
system standards.
**********
Of course, Lisa did not necessarily
intend to read his books. She might
want the computer only to write her
midterm. But Dan knew she came from
a middle-class family and could hardly
afford the tuition, let alone her reading
fees. Books might be the only way she
could graduate
**********
Research in analysis (i.e., the evaluation
of the strengths and weaknesses of
computer system) is essential to the
development of effective security, both
for works protected by copyright law
and for information in general. Such
research can progress only through the
open publication and exchange of
complete scientific results
**********
I am very very very happy!
What about you?
**********
6
computer
books AND computer
books OR protected
NOT security
very
slick

Sample Output

want the computer only to write her
---------
computer system) is essential to the
==========
intend to read his books. She might
want the computer only to write her
fees. Books might be the only way she
==========
intend to read his books. She might
fees. Books might be the only way she
---------
for works protected by copyright law
==========
Of course, Lisa did not necessarily
intend to read his books. She might
want the computer only to write her
midterm. But Dan knew she came from
a middle-class family and could hardly
afford the tuition, let alone her reading
fees. Books might be the only way she
could graduate
---------
I am very very very happy!
What about you?
==========
I am very very very happy!
==========
Sorry, I found nothing.
==========

HINT

題目大意(摘自紫皮書):

輸入n 篇文章和m 個請求(n <100,m ≤50000),每個請求都是以下4種格式之一。

  • A:查詢包含關鍵字A的文章。

  • A AND B:查詢同時包含關鍵字A和B的文章。

  • A OR B:查詢包含關鍵字A或B的文章。

  • NOT A:查詢不包含關鍵字A的文章。

    處理詢問時,需要對於每篇文章輸出證據。前3種詢問輸出所有至少包含一個關鍵字的行,第4種詢問輸出整篇文章。關鍵字只由小寫字母組成,查詢時忽略大小寫。每行不超過80個字元,一共不超過1500行。

由於紫皮書上說的很簡練,直接摘過來使用了。

注意:

  • 查詢的時候不區分大小寫
  • 關鍵詞裡面不包含非字母字元
  • 輸出時候每一個指令,輸出一篇文章中所有結果後輸出一行“----------”共10個,pdf上是9個!對於每一個指令輸出結果的最後一行用10個等號代替減號。

解題思路:

使用map對映每一個關鍵字的座標。使用set來儲存每一個關鍵字對應的出現的地址座標集合。

指令裡面的部分內容可以使用set集合自帶的函式實現:set_union()、set_intersection()。

vector<vector<string>>text 儲存文章內容

map<string, set<vector<int>>>point 鍵是關鍵字,鍵值是出現的地址集合,集合的每一個元素是一個由連個元素組成的座標陣列

map<string, set<int>>textid 用來對映每一個關鍵字出現過的文章編號

前三個指令按行輸出,按文章判斷,因此從得到的關鍵字的文章編號集合裡面對應的每一個文章找到關鍵字的在文章中的行號進行輸出。按整篇文章輸出較簡單,不解釋。

本以為使用集合再帶的求並集、合集的函式會更簡單一些,沒想到寫寫出來程式碼量一點不少,細節處理一點不少😂

Accepted

#include<bits/stdc++.h>
using namespace std;

#define ALL(x) x.begin(),x.end()
#define INS(x) inserter(x,x.begin())
vector<vector<string>>text;							//儲存文章
map<string, set<vector<int>>>point;					//儲存每一個關鍵字座標
map<string, set<int>>textid;						//儲存每一個關鍵字的文章號


void print(set<int>p,string s1,string s2="") {
	if (p.empty())cout << "Sorry, I found nothing." << endl<<"=========="<<endl;//如果儲存文章編號的集合是空的
	else {
		for (auto i = p.begin();i != p.end();) {
			int flag = 0;
			set<vector<int>>p1, p2, p3;
			if (point.count(s1))p1 = point[s1];									//將兩個關鍵得到的集合進行合併成一個,目的是為了消除一行中出現連個關鍵詞
			if (point.count(s2))p2 = point[s2];									//並且也可以將兩個集合真和到一個集合中進行排序
			set_union(ALL(p1), ALL(p2), INS(p3));
			for (auto j = p3.begin();j != p3.end();j++)							//輸出是一個文章一個文章進行輸出,但關鍵詞對應的集合裡面可能還有其他文章的
				if ((*j).front() == (*i)){										//因此需要進行判斷
					cout << text[(*i)][(*j).back()] << endl;
					flag++;														//如果沒有輸出就輸出Sorry
				}
			if (!flag)cout << "Sorry, I found nothing." << endl;
			cout << ((++i) == p.end() ? "==========" : "----------") << endl;
		}
	}
}

int main() {
	
	int sum,num,  c = 0;			//文章總數、關鍵字在文章中的行號
	string s,s1,s2;
	cin >> sum;getchar();		//吃掉回車
	for (int i = 0;i < sum;i++) {//迴圈錄入
		vector<string> temp;c = 0;//初始化,temp用來儲存文章
		while (getline(cin, s) && s != "**********") {
			temp.push_back(s);		//錄入文章的每一行
			for (int j = 0;j < s.size();j++)
				if (!isalpha(s[j]))s[j] = ' ';//將每一行進行轉化,吃掉非字母並將大寫字母轉化未小寫
				else if(isupper(s[j]))s[j] = s[j] - 'A' + 'a';
			stringstream ss(s);
			vector<int>id;
			id.push_back(i);id.push_back(c);	//記錄座標
			while (ss >> s) { 
				textid[s].insert(i);//將每一個關鍵字的文章號記錄下來
				point[s].insert(id);			//將座標錄入map
			}
			c++;								//行號++
		}
		text.push_back(temp);					//將文章入棧
	}
	cin >> num;getchar();						//指令數量
	set<int>p1, p2, p;
	for (int i = 0;i < num;i++) {
		getline(cin, s);
		p1.clear();p2.clear();p.clear();s1.clear();s2.clear();
		if (s.find("AND") ==-1&& s.find("OR") ==-1&& s.find("NOT") ==-1) {	//如果僅僅包含關鍵字
			for (int i = 0;i < s.size();i++) if (isupper(s[i]))s[i] = s[i] - 'A' + 'a';//轉化未小寫
			if (textid.count(s))p = textid[s];								//將關鍵字的文章號取出來
			s1 = s;
		}
		else{																//如果包含指令
			stringstream ss(s);	
			if (s.find("NOT") != -1) {										//如果是NOT
				ss >> s >> s1;
				for (int j = 0;j < s1.size();j++)if (isupper(s1[j]))s1[j] = s1[j] - 'A' + 'a';
				if (textid.count(s1))p = textid[s1];						//獲取文章編號
				if (p.size() == sum) { cout << "Sorry, I found nothing." << endl << "==========" << endl;continue; }
				int flag = 0;												
				for (int j = 0;j < text.size();j++) {					
					if (!p.count(j)) {										//如果集合裡面沒有此文章的下標編號就輸出
						cout << (0 != flag ? "----------\n" : "");
						for (int h = 0;h < text[j].size();h++)
							cout << text[j][h] << endl;
						flag++;
							
					}
				}
				cout << "==========" << endl;
				continue;
			}
			else {
				ss >> s1 >> s >> s2;
				for (int j = 0;j < s1.size();j++)if (isupper(s1[j]))s1[j] = s1[j] - 'A' + 'a';//轉化大寫
				for (int j = 0;j < s2.size();j++)if (isupper(s2[j]))s2[j] = s2[j] - 'A' + 'a';
				if (textid.count(s1))p1 = textid[s1];											//獲取集合
				if (textid.count(s2))p2 = textid[s2];
				if(s=="AND")set_intersection(ALL(p1), ALL(p2), INS(p));							//求並集
				if(s=="OR")	set_union(ALL(p1), ALL(p2), INS(p));								//求和
			}
		}
		print(p,s1,s2);																			//列印結果
	}
}