자바를 자바 22 (Reading from HTML)

간단소개

팔만코딩경 컨트리뷰터

ContributorNotionAccount

주제 / 분류

Java

Scrap

태그

9 more properties

Reading contents from an HTML file

우리는 데이터를 URL로 부터 읽어들일 수 있다. 이 과정은 파일로 부터 데이터를 읽어오는 것과 매우 유사하다.

URL url = null;
BufferedReader input = null;
//특정 위치의 html파일을 읽어오기
String address = "<https://icslsogang.github.io/courses/cse3040/hello.html>";
String line = "";

try {
	url = new URL(address);
	input = new BufferedReader(new InputStreamReader(url.openStream()));

	while((line=input.readLine()) != null) {
		System.out.println(line);
	}
	input.close();
} catch(Exception e) {
	e.printStackTrace();
}
Plain Text
복사

위와 같이 URL에서 가져온 파일을 읽어들일 수 있다는 것을 확인할 수 있다.

Downloading a file from a URL

URL url = null;
InputStream in = null;
FileOutputStream out = null;

// 읽고자 하는 것이 jpg 파일인 경우
String address = "<https://icslsogang.github.io/courses/cse3040/sogang_campus.jpg>";

int ch = 0;
try {
	url = new URL(address);

    // 이 방식으로 inputStream처럼 만들 수 있다.
	in = url.openStream();

    // 아래의 파일에 이미지를 씌우겠다는 의미
	out = new FileOutputStream("sogang_campus.jpg");
	while((ch=in.read()) != -1) {
		out.write(ch);
	}
	in.close();
	out.close();
} catch(Exception e) {
	e.printStackTrace();
}
System.out.println("File download complete.");
Plain Text
복사

Parsing an HTML File

public class Lecture {
	static ArrayList<String> lines = new ArrayList<String>();
	public static void main(String[] args) {
		URL url = null;
		BufferedReader input = null;

        // 교보문고의 베스트셀러들을 보여주는 페이지가 파일로 되어 있다.
		String address = "<http://www.kyobobook.co.kr/bestSellerNew/bestseller.laf>";
		String line = "";

        // 베스트 셀러의 제목을 앍아내기
		try {
			url = new URL(address);
			input = new BufferedReader(new InputStreamReader(url.openStream()));

            // 일단 읽어들여서 Line by Line으로 저장한다.
			while((line=input.readLine()) != null) {
				if(line.trim().length() > 0) lines.add(line);
			}
			input.close();
		} catch(Exception e) {
			e.printStackTrace();
		}

		int rank = 1;
		int status = 0;

        // 한줄씩 가지고 오면서 데이터를 분석한다.
		for(int i=0; i<lines.size(); i++) {
			String l = lines.get(i);
			if(status == 0) {
				if(l.contains("div class=\\"detail\\"")) status = 1;
			} else if(status == 1) {
				if(l.contains("div class=\\"title\\"")) status = 2;
			} else if(status == 2) {
				if(l.contains("a href")) {
					int begin = l.indexOf("<strong>") + "<strong>".length();
					int end = l.indexOf("</strong>");
					System.out.println(rank + "위: " + l.substring(begin, end));
					status = 0;
					rank++;
				}
			}
		}
	}
}
Plain Text
복사

Parsing an HTML file using jsoup

jsoup : Java 외부 라이브러리로 HTML 로 부터 데이터를 뽑아내거나 복제하는데 편리하다.

https://jsoup.org/download jsoup을 build Path의 Add External Archives를 통해 추가시키면 사용이 가능하다. 하지만 여기에서 중요한 것이 만약 이전에 Module-info를 한 번이라도 만들었다면 문제가 생길 수 있다. 그렇게 때문에 이것을 미리 프로젝트에서 삭제해주어야 한다.

jsoup을 사용하면 아래와 같이 쫌 더 짧아지고 원하는 결과물을 구해낼 수 있다.

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class Lecture {
	public static void main(String[] args) throws Exception {
		String url = "<http://www.kyobobook.co.kr/bestSellerNew/bestseller.laf>";
		Document doc = null;

        // url에 접속한다.
		try {
			doc = Jsoup.connect(url).get();
		} catch(IOException e) {
			System.out.println(e.getMessage());
		}

        // select를 사용해서 doc에서 이 태그가 나오는 부분을 찾아준다. 그래서 이것을 bestsellers에 넣어준다.
        //그래서 div detail 부분의 모든 element를 가지고 온다.
		Elements bestsellers = doc.select("div.detail");

        // bestseller로 가져온 것중에 div title 부분만 가져온다.
		Elements titles = bestsellers.select("div.title");

        // title 내부의 href 태그만 가져온다.
		Elements booktitles = titles.select("a[href]");

        // for을 이용해서 meta-data가 아닌 부분만 출력하게 한다.
		for(int i=0; i<booktitles.size(); i++) {
			System.out.println(i+1 + "위: " + booktitles.eq(i).text());
		}
	}
}
Plain Text
복사

try {
	doc = Jsoup.connect(url).get();
} catch(IOException e) {
	System.out.println(e.getMessage());
}
Plain Text
복사

MalformedURLException, HttpStatusException, UnsupportedMimeTypeException, SocketTimeoutException, IOException와 같은 에러가 생길 수 있는데, IOException이 최상위 개념이다.

위의 메뉴얼은 jsoup에서 가져올 수 있는 데이터들의 종류들이다.