View on GitHub

Depfinder

Dependency aware sentence search engine for Japanese sentences

Download this project as a .zip file Download this project as a tar.gz file

Do you find it easy to search for example sentences on web? I sometimes do not. Search engines like Google are looking for documents: large pieces for text which contain keywords in a query.

But... queries just on words do not work always well! DepFinder is a search engine which gives you more power, but at the same time a bit more difficult to master than a regular search engine.

What is DepFinder

Well... It is a search engine. It is available on http://lotus.kuee.kyoto-u.ac.jp/depfinder/search. I try to have it online as much as possible, but some there could be some downtime. Total number of available sentences could vary from time to time as well.

And no sources yet, sorry.

Usually, search engines allows to do search only by words. DepFinder is different. It allows to use:

The first one is a usual part, other three are probably not. Let's see how exactly it is different from general search engines.

What can DepFinder do

Here is a collection of interesting queries.

Usage of onomatopoeia

~物が→ピリピリ→~動く*,-する*,@ピリピリ

It is difficult to find good examples for adverbs and especially onomatopoeia. But it is not a problem for DepFinder.

Checking if a word can be used in a certain situation

〜2つの→面から

I used this query to find out whether it is possible to say in Japanese "to think about a problem from X sides" using a word 面.

Usage of grammar

What can you do from morning till evening

朝から→昼まで→〜飲む*

DepFinder by Example

This section introduces query language, from simple constructs to all its power.

Dependency Query

A most famous Japanese plant is sakura, or 桜. Motion of its petals when falling down from a tree is usually poetically described with the verb 舞い落ちる. Let's see in what situations this "phrase" is used: 桜→舞い落ちる. Search results should have list of sentences like:

Each of sentences contain both 桜 and 舞い落ちる. But there is more. In each sentence there is a dependency relation between these two words. For the detailed explanation please refer to the linked Wikipedia article. However, in a simple terms, dependency relation is formed between the words that have the strongest connection in a sentence.

In this case, 桜 is called a child and 舞い落ちる is a parent. There could be other words between a child and a parent, like in the sentence "そして桜は静かに舞い落ちる". It is possible to swap our two words in a query, creating an another one: 舞い落ちる→桜. 舞い落ちる becomes a child of 桜 in the search results of this query.

Several dependencies

It is possible to specify several dependencies at the same time: 綺麗な→咲く→桜. They will be processed as if they were on the same level --- sibling children of the last element.

Use arrow symbol (→ or ->) to specify a dependency between two or more words.

Multiple Inclusions

DepFinder takes its raw data from the Internet, and search result can contain sentences which are not "clean". Also, by default, DepFinder tries to match a query as many times as possible. For example, lets search just for sakura: . Results are going to contain sentences which have multiple sakuras in them and it is not very useful. Let's prepend an @ symbol to a query: @桜. DepFinder prefers only single matches of such queries in a sentence. Effectively, it gives a way to control whether you want to have some word to happen one time or maybe more than one time in a sentence.

There is one remark. Most useful queries currently require to append @ symbol to almost all query parts. Future revisions of DepFinder is going to probably reverse the current behavior of @ symbol: don't like duplicates by default and allow them when told.

Use @ symbol to prefer single inclusions of a query.

Grammatical Form Query

Let's return to our sakura. Before its petals fall down, it surely has to bloom: @桜→咲く. Note, that every sentence have only basic form of 咲く. Let's try another form: @桜→咲いてほしい. This time it's only 咲いてほしい! DepFinder matches exact form of a query by default.

To find any grammatical form of a word, add * after the word: @桜→咲いた*. This query have past form of 咲く, but because of *, DepFinder matches any form of 咲く.

The star can be used in forms containing more than one grammatical part as well: @桜→咲いています*. In such cases it modifies only last grammatical part. In current example it was ます and its possible forms could be ました or ません.

Find exact grammatical form by default. Append * to find any grammatical form.

Part of Speech Query

Are you already bored of sakura? I am. Let's find something else what can bloom. We will do that by asking DepFinder the following query: @~桜が→咲く*. The meaning of a new symbol -- tilde (~) is to find sentences that contain words that have the same part of speech as the word prefixed by a tilde. In the query 桜 is a noun, so the query becomes "find a noun with が that has 咲く in any form as a parent" if described in English.

Of course, this query works with other parts of speech as well: @〜綺麗な→家, @〜強く→吹く*, @〜ゆっくりの→俺. Because DepFinder keeps grammatical form of queries, part of speech queries can be useful for searching a word with some grammar.

Use ~A to find sentences that contain words of the same part of speech as A.

Compound Query

Queries described above are primitive. They can be combined to search for even more complex things. By separating two queries with a comma you get a single compound query: @聞く、@動物. In general search systems like Google spaces are used as word separators, but DepFinder uses comma in this meaning. Additionally, all spaces in a query are ignored.

Compound query searches for at least one of its parts, however it prefers to match as much parts as possible.

Query Part Modifiers

There are three query part modifiers: @, + and -. The first one was explained earlier. Other two have their usual meaning in search systems.

Plus

Plus modifier (+) makes the search engine to always match a query part marked by the plus. Let's compare the number of hits of two queries: 聞く、動物 and +聞く、+動物. The first one essentially searches for either 聞く or 動物 in a sentence, however the second one searches only for both at the same time. This explains the difference in the number of hits.

Minus

Minus modifier (-) makes the search engine to find sentences that does not match a query part marked by the minus. For example, let's find an action of sakura except blooming: @桜が→~咲く,-咲く.

Contact Information:

E-mail: arseny <:an email sign:> nlp.ist.i.kyoto-u.ac.jp

Twitter (Mostly in Russian): @eiennohito

Twitter (Mostly in Japanese): @to_aruchan

Details

TODO: Write more clearly.

Priority/Precedence

The query operations have the following precedence or order of resolution:

  1. grammatical query
  2. part of speech query
  3. dependency query
  4. query part modifiers
  5. compound query separators

Query part scope

Every query part should be a bunsetsu. Basically, it has usually one content word with all attached grammatical words.

Examples of bunsetsu separation of sentences:

Queries like 桜が咲く will not work because they contain two bunsetsu, you need to separate them either to compound or dependency query.