Lucene TokenStreams

Notice

Recent Posts

Recent Comments

Link

« 2025/12 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

Stack Empty

Lucene TokenStreams 본문

Search/Lucene

Lucene TokenStreams

항상 초심으로.. 2013. 6. 18. 00:13

루씬은 필드의 값을 변환하기 위해 토큰의 스트림으로 마킹되는 분석기를 채용했다. 
색인 시점에 루씬은 분석기를 지정하게 되어 있고, 이 때 지정된 분석기는 필드와 텍스트로 이뤄진 TokenStream
으로 매핑 작업을 한다. 오늘은 이 TokenStream의 간단한 사용법을 보여주는 Class를 짜보자.

package com.tistory.outofmemoryerror.lucene;

import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.Version;

public class TokenStreamTutorial {
	public static void main(String[] args) throws Exception {
		String text = 
				"Every mammal on this planet instinctively develops a natural " + 
				"equilibrium with the surrounding environment; " + 
				"but you humans do not. Instead you multiply, " + 
				"and multiply, until every resource is consumed." + 
				"The only way for you to survive is to spread to another area. " + 
				"There is another organism on this planet " + 
				"that follows the same pattern... a virus.";
		String fieldName = "content";
		
		// 토크나이저 생성을 위한 기본 분석기를 생성한다.
		// 생성 시 루씬의 버전을 전달 받는다. (버전에 따라 수행 행동이 달라질 수 있다.) 
		StandardAnalyzer standardAnalyzer = new StandardAnalyzer(Version.LUCENE_36);
		// 입력 글자를 StringReader로 감싼다. Documents 객체가 아닌 이상
		// 다른 대안이 존재치 않으며 StringReader로 Wrapping한 경우 객체 참조를 위한
		// 다른 핸들러를 유지하지 않기 때문에 좋다.
		Reader textReader = new StringReader(text);
		
		// 필드명과 텍스트 값을 위한 TokenStream 생성
		TokenStream tokenStream = standardAnalyzer.tokenStream(fieldName, textReader);
		// CharTermAttribute Class 명은 원래 TermAttribute였는데 3.1 버전이 넘어가면서 
		// CharSequence와 Type을 맞추기 위해 변경되었다고..
		
		// TokenStream은 아래에 지정된 세개의 속성을 붙여넣을 수 있는데 
		// CharTermAttribute Class는 Term(Term은 루씬에서 검색과 색인을 위한 
		// 기본적인 단위라고 보면 된다. 단어라고 생각해도 무방)을 반환하고,
		// OffsetAttribute Class는 단어가 문서에서 어느 위치에 있나 (시작점, 끝점)를 반환하며
		// PositionIncrementAttribute Class는 현재 토큰이 이전 토큰과 얼마나 떨어져 있나를
		// 반환한다.
		CharTermAttribute terms = tokenStream.addAttribute(CharTermAttribute.class);
		OffsetAttribute offsets = tokenStream.addAttribute(OffsetAttribute.class);
		PositionIncrementAttribute positions = 
				tokenStream.addAttribute(PositionIncrementAttribute.class);
		
		System.out.println("INCR\t(START,\tEND)\tTERM");		
		while (tokenStream.incrementToken()) {
			int increment = positions.getPositionIncrement();
			int start = offsets.startOffset();
			int end = offsets.endOffset();
			String term = terms.toString();

System.out.print(increment + "\t" + "(" + start + ",\t\t" + end + ")\t" + term);
			System.out.println();
		}
		
		standardAnalyzer.close();
	}
}

아래는 실행 결과이다.

INCR	(START,	       END)	TERM
1	(0,		5)	every
1	(6,		12)	mammal
3	(21,		27)	planet
1	(28,		41)	instinctively
1	(42,		50)	develops
2	(53,		60)	natural
1	(61,		72)	equilibrium
3	(82,		93)	surrounding
1	(94,		105)	environment
2	(111,		114)	you
1	(115,		121)	humans
1	(122,		124)	do
2	(130,		137)	instead
1	(138,		141)	you
1	(142,		150)	multiply
2	(156,		164)	multiply
1	(166,		171)	until
1	(172,		177)	every
1	(178,		186)	resource
2	(190,		202)	consumed.the
1	(203,		207)	only
1	(208,		211)	way
2	(216,		219)	you
2	(223,		230)	survive
3	(237,		243)	spread
2	(247,		254)	another
1	(255,		259)	area
3	(270,		277)	another
1	(278,		286)	organism
3	(295,		301)	planet
2	(307,		314)	follows
2	(319,		323)	same
1	(324,		331)	pattern
2	(337,		342)	virus

단어는 모두 소문자로 변경되어 있고 비단어나, 내부 구두점, Stop Word등이 삭제된 걸 볼 수 있다. INCR 필드의 값을 보면 알겠지만 비단어의 삭제와는 다르게 Stop Word가 삭제처리 되었을 경우 PositionIncrementAttribute 에서 해당 값을 하나 증가 시켜주는 걸 볼 수 있다.

'Search > Lucene' 카테고리의 다른 글

Lucene의 색인 - Document 갱신 (0)	2013.07.01
Lucene의 색인 - Document 삭제 (0)	2013.06.28
Lucene의 색인 (0)	2013.06.26
Lucene의 색인 범위 제한 (0)	2013.06.20
Lucene으로 색인 만들기 (0)	2013.06.14

'Search/Lucene' Related Articles

Stack Empty

Lucene TokenStreams 본문

Lucene TokenStreams

'Search > Lucene' 카테고리의 다른 글

티스토리툴바