출처: http://www.rgagnon.com/javadetails/java-0487.html

(1) Java 7 버전에 포함되어 있는 nio 패키지를 사용하여 쉽게 파일의 타입을 확인할 수 있다.

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

public class Test {
  public static void main(String[] args) throws IOException {
    Path source = Paths.get("c:/temp/0multipage.tif");
    System.out.println(Files.probeContentType(source));
    // output : image/tiff
  }
}

[참조] Transparently improve Java 7 mime-type recognition with Apache Tika.

(2) javax.activation.MimetypesFileTypeMap 을 사용하여 파일 타입을 체크할 수 있다.

activation.jar 라이브러리를 다운로드 받아서 사용한다. from http://java.sun.com/products/javabeans/glasgow/jaf.html.

MimetypesFileMap 클래스는 File을 Mime타입에 매핑하는데 사용한다. activation.jar 내의 리소스 파일에서 정의된 Mime Type을 사용한다.

import javax.activation.MimetypesFileTypeMap;
import java.io.File;

class GetMimeType {
  public static void main(String args[]) {
    File f = new File("gumby.gif");
    System.out.println("Mime Type of " + f.getName() + " is " +
                         new MimetypesFileTypeMap().getContentType(f));
    // expected output :
    // "Mime Type of gumby.gif is image/gif"
  }
}

기본적으로 빌트인 되어 있는 mime-type 리스트는 매우 적지만 쉽게 mime 타입을 추가하여 확장하는 것이 가능하다.

Request가 MimetypesFileTypeMap에서 MIME 타입을 검색할때, 다음과 같은 순서로 검색한다.

Programmatically added entries to the MimetypesFileTypeMap instance.
The file .mime.types in the user's home directory.
The file <java.home>/lib/mime.types.
The file or resources named META-INF/mime.types.
The file or resource named META-INF/mimetypes.default (usually found only in the activation.jar file).

이 메서드는 파일이름과 함께 들어오는 파일에 대한 정규화 작업을 수행할때 유용하게 사용할 수 있다. 주어진 파일의 확장자만 이용해서 파일의 타입을 판단하기 때문에 처리 시간이 아주 짭다.

(3) java.net.URL 을 사용한다.

이 방법은 동작이 매우 느리므로 주의하여야 한다.

이 방법은 확장자를 이용하여 파일의 타입을 결정한다.

[jre_home]\lib\content-types.properties 파일내에 정의되어 있는 mime-type과 확장자를 매핑하여 파일의 타입을 결정한다.

import java.net.*;

public class FileUtils{
  public static String getMimeType(String fileUrl)
    throws java.io.IOException, MalformedURLException
  {
    String type = null;
    URL u = new URL(fileUrl);
    URLConnection uc = null;
    uc = u.openConnection();
    type = uc.getContentType();
    return type;
  }

  public static void main(String args[]) throws Exception {
    System.out.println(FileUtils.getMimeType("file://c:/temp/test.TXT"));
    // output :  text/plain
  }
}

R. Lovelock 참고 :

나는 파일의 MIME 형식을 얻는 가장 좋은 방법을 찾기 위해 노력했으며, 당신의 사이트가 매우 유용하다고 생각한다.하지만 URLConnection의 사용을 MIME 유형을 얻는것이 당신이 설명한것 만큼 느리지 않다고 생각한다.

import java.net.FileNameMap;
import java.net.URLConnection;

public class FileUtils {

  public static String getMimeType(String fileUrl)
      throws java.io.IOException
    {
      FileNameMap fileNameMap = URLConnection.getFileNameMap();
      String type = fileNameMap.getContentTypeFor(fileUrl);

      return type;
    }

    public static void main(String args[]) throws Exception {
      System.out.println(FileUtils.getMimeType("file://c:/temp/test.TXT"));
      // output :  text/plain
    }
  }

(4) Apache Tika 을 이용하여 파일의 타입을 결정한다.

Tika는 검색엔진인 Lucene의 서브프로젝로, parser 라이브러리를 이용하여 다양한 문서들에서 메타 데이터를 추출하고, 텍스트 컨텐츠를 구조화하고, 검출하는 툴킷이다.

이 패키지는 파일 타입에 대해 자주 업데이터가 되며, 오피스 2007 포맷(docs/pptx/xlsx/etc...) 들도 지원이 된다.

Apache Tika

Tika는 20개가 넘는 jars 디펜던시를 가진다. 하지만 많은 종류의 파일 타입을 검출할 수 있으므로 유용하게 사용 할 있다.

import java.io.File;
import java.io.FileInputStream;

import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.ContentHandler;

public class Main {

    public static void main(String args[]) throws Exception {

    FileInputStream is = null;
    try {
      File f = new File("C:/Temp/mime/test.docx");
      is = new FileInputStream(f);

      ContentHandler contenthandler = new BodyContentHandler();
      Metadata metadata = new Metadata();
      metadata.set(Metadata.RESOURCE_NAME_KEY, f.getName());
      Parser parser = new AutoDetectParser();
      // OOXMLParser parser = new OOXMLParser();
      parser.parse(is, contenthandler, metadata);
      System.out.println("Mime: " + metadata.get(Metadata.CONTENT_TYPE));
      System.out.println("Title: " + metadata.get(Metadata.TITLE));
      System.out.println("Author: " + metadata.get(Metadata.AUTHOR));
      System.out.println("content: " + contenthandler.toString());
    }
    catch (Exception e) {
      e.printStackTrace();
    }
    finally {
        if (is != null) is.close();
    }
  }
}

here를 클릭하여 Tika 에서 요구되는 jars 파일들을 다운로드 할 수 있다. here

(5) JMimeMagic 라이브러리를 이용하여 파일의 타입을 결정한다.

파일 확장자를 이용하여 파일의 타입을 결정하는 것은 안전하지 않다. JMimeMagic 라이브러리를 이용하면 파일의 매직코드(리눅스 시스템의 file 명령과 유사)이용하여 파일의 타입을 결정할 수 있다. theJMimeMagic library. JMimeMagic 은 LGPL licence로 사용 가능하며 파일스트림의 매직 헤더값을 검사하여 파일 타입을 결정한다.

public void check(File filename) throws Exception {

Magic parser = new Magic();

FileInputStream is = new FileInputStream(filename);

byte[] data=new byte[1024];

is.read(data);

MagicMatch match = parser.getMagicMatch(data);

System.out.println(match.getMimeType()) ;

}

jmimemagic.zip

check.txt

// snippet for JMimeMagic lib
//     http://sourceforge.net/projects/jmimemagic/

Magic parser = new Magic() ;
// getMagicMatch accepts Files or byte[],
// which is nice if you want to test streams
MagicMatch match = parser.getMagicMatch(new File("gumby.gif"));
System.out.println(match.getMimeType()) ;

Thanks to Jean-Marc Autexier and sygsix for the tip!

(6) mime-util 툴을 이용하여 파일의 타입을 결정한다.

mime-util. 은 magic header 기술을 이용하거나 파일의 확장자를 체크하여 파일의 타입을 결정한다.

import eu.medsea.mimeutil.MimeUtil;

public class Main {
    public static void main(String[] args) {
        MimeUtil.registerMimeDetector("eu.medsea.mimeutil.detector.MagicMimeMimeDetector");
        File f = new File ("c:/temp/mime/test.doc");
        Collection<?> mimeTypes = MimeUtil.getMimeTypes(f);
        System.out.println(mimeTypes);
        //  output : application/msword
    }
}

The nice thing about mime-util is that it is very lightweight. Only 1 dependency with slf4j

(7) Droid 을 사용하여 파일의 타입을 결정한다.

DROID (Digital Record Object Identification)는 파일 포맷을 자동으로 구분할 수 있는 소프트웨어 툴이다.

DROID 는 디지털 파일의 특정 파일 포맷 버전을보고하기 위해 내부 및 외부 시그니처를 이용한다. 이러한 시그니처는 PRONOM 기술 레지스트리에 기록 된 정보로부터 생성하여 XML 시그니처 파일에 저장된다. 새롭게 또는 갱신된 시그니처는 정기적으로 PRONOM에 추가되며, DROID 자동 웹 서비스를 통해 PRONOM 웹 사이트에서 갱신 된 서명 파일을 다운로드할 수 있도록 구성되어 있다. 자바 스윙 GUI 또는 명령어 인터페이스에서 호출 하여 사용 할 수 있다.

http://droid.sourceforge.net/wiki/index.php/Introduction

(8) Aperture framework 을 이용하여 파일 타입을 결정한다.

Aperture는 파일 시스템, 웹 사이트 및 메일 박스와 같은 크롤링 및 인덱싱 정보 소스에 대한 오픈 소스 라이브러리와 프레임 워크이다. Aperture code 는 관련되지만 독립적으로 사용 가능한 부품 번호로 구성되어 있다.

Crawling of information sources: file systems, websites, mail boxes
MIME type identification
Full-text and metadata extraction of various file formats
Opening of crawled resources

각각의 파트는 일련의 API가 개발되어있고, 구현 번호가 제공되고 있다.

http://aperture.wiki.sourceforge.net/Overview

file magic numbers

Magic numbers are the first bits of a file which uniquely identify the type of file. This makes programming easier because complicated file structures need not be searched in order to identify the file type.

For example, a jpeg file starts with ffd8 ffe0 0010 4a46 4946 0001 0101 0047 ......JFIF.....G
ffd8 shows that it's a JPEG file, and ffe0 identify a JFIF type structure. There is an ascii encoding of "JFIF" which comes after a length code, but that is not necessary in order to identify the file. The first 4 bytes do that uniquely.

This page gives an ongoing list of file-type magic numbers.

Jump to: Image files | Compressed files | Archive files | Executable files | Miscellaneous files

Image files

File type	Typical extension	Hex digits xx = variable	Ascii digits . = not an ascii char
Bitmap format	.bmp	42 4d	BM
FITS format	.fits	53 49 4d 50 4c 45	SIMPLE
GIF format	.gif	47 49 46 38	GIF8
Graphics Kernel System	.gks	47 4b 53 4d	GKSM
IRIS rgb format	.rgb	01 da	..
ITC (CMU WM) format	.itc	f1 00 40 bb	....
JPEG File Interchange Format	.jpg	ff d8 ff e0	....
NIFF (Navy TIFF)	.nif	49 49 4e 31	IIN1
PM format	.pm	56 49 45 57	VIEW
PNG format	.png	89 50 4e 47	.PNG
Postscript format	.[e]ps	25 21	%!
Sun Rasterfile	.ras	59 a6 6a 95	Y.j.
Targa format	.tga	xx xx xx	...
TIFF format (Motorola - big endian)	.tif	4d 4d 00 2a	MM.*
TIFF format (Intel - little endian)	.tif	49 49 2a 00	II*.
X11 Bitmap format	.xbm	xx xx
XCF Gimp file structure	.xcf	67 69 6d 70 20 78 63 66 20 76	gimp xcf
Xfig format	.fig	23 46 49 47	#FIG
XPM format	.xpm	2f 2a 20 58 50 4d 20 2a 2f	/* XPM */

Compressed files

File type	Typical extension	Hex digits xx = variable	Ascii digits . = not an ascii char
Bzip	.bz	42 5a	BZ
Compress	.Z	1f 9d	..
gzip format	.gz	1f 8b	..
pkzip format	.zip	50 4b 03 04	PK..

Archive files

File type	Typical extension	Hex digits xx = variable	Ascii digits . = not an ascii char
TAR (pre-POSIX)	.tar	xx xx	(a filename)
TAR (POSIX)	.tar	75 73 74 61 72	ustar (offset by 257 bytes)

Excecutable files

File type	Typical extension	Hex digits xx = variable	Ascii digits . = not an ascii char
MS-DOS, OS/2 or MS Windows		4d 5a	MZ
Unix elf		7f 45 4c 46	.ELF

Miscellaneous files

File type	Hex digits xx = variable	Ascii digits . = not an ascii char
pgp public ring	99 00	..
pgp security ring	95 01	..
pgp security ring	95 00	..
pgp encrypted data	a6 00	¦.

Apache Module mod_mime_magic

Description:	Determines the MIME type of a file by looking at a few bytes of its contents
Status:	Extension
Module Identifier:	mime_magic_module
Source File:	mod_mime_magic.c

Summary

This module determines the MIME type of files in the same way the Unix file(1) command works: it looks at the first few bytes of the file. It is intended as a "second line of defense" for cases thatmod_mime can't resolve.

This module is derived from a free version of the file(1) command for Unix, which uses "magic numbers" and other hints from a file's contents to figure out what the contents are. This module is active only if the magic file is specified by the MimeMagicFile directive.

Format of the Magic File

The contents of the file are plain ASCII text in 4-5 columns. Blank lines are allowed but ignored. Commented lines use a hash mark (#). The remaining lines are parsed for the following columns:

Column Description

1 byte number to begin checking from
">" indicates a dependency upon the previous non-">" line

type of data to match

`byte`	single character
`short`	machine-order 16-bit integer
`long`	machine-order 32-bit integer
`string`	arbitrary-length string
`date`	long integer date (seconds since Unix epoch/1970)
`beshort`	big-endian 16-bit integer
`belong`	big-endian 32-bit integer
`bedate`	big-endian 32-bit integer date
`leshort`	little-endian 16-bit integer
`lelong`	little-endian 32-bit integer
`ledate`	little-endian 32-bit integer date

3 contents of data to match

4 MIME type if matched

5 MIME encoding if matched (optional)

For example, the following magic file lines would recognize some audio formats:

# Sun/NeXT audio data
0      string      .snd
>12    belong      1       audio/basic
>12    belong      2       audio/basic
>12    belong      3       audio/basic
>12    belong      4       audio/basic
>12    belong      5       audio/basic
>12    belong      6       audio/basic
>12    belong      7       audio/basic
>12    belong     23       audio/x-adpcm

Or these would recognize the difference between *.doc files containing Microsoft Word or FrameMaker documents. (These are incompatible file formats which use the same file suffix.)

# Frame
0  string  \<MakerFile        application/x-frame
0  string  \<MIFFile          application/x-frame
0  string  \<MakerDictionary  application/x-frame
0  string  \<MakerScreenFon   application/x-frame
0  string  \<MML              application/x-frame
0  string  \<Book             application/x-frame
0  string  \<Maker            application/x-frame

# MS-Word
0  string  \376\067\0\043            application/msword
0  string  \320\317\021\340\241\261  application/msword
0  string  \333\245-\0\0\0           application/msword

An optional MIME encoding can be included as a fifth column. For example, this can recognize gzipped files and set the encoding for them.

# gzip (GNU zip, not to be confused with
#       [Info-ZIP/PKWARE] zip archiver)

0  string  \037\213  application/octet-stream  x-gzip

Performance Issues

This module is not for every system. If your system is barely keeping up with its load or if you're performing a web server benchmark, you may not want to enable this because the processing is not free.

However, an effort was made to improve the performance of the original file(1) code to make it fit in a busy web server. It was designed for a server where there are thousands of users who publish their own documents. This is probably very common on intranets. Many times, it's helpful if the server can make more intelligent decisions about a file's contents than the file name allows ...even if just to reduce the "why doesn't my page work" calls when users improperly name their own files. You have to decide if the extra work suits your environment.

Notes

The following notes apply to the mod_mime_magic module and are included here for compliance with contributors' copyright restrictions that require their acknowledgment.

mod_mime_magic: MIME type lookup via file magic numbers
Copyright (c) 1996-1997 Cisco Systems, Inc.

This software was submitted by Cisco Systems to the Apache Group in July 1997. Future revisions and derivatives of this source code must acknowledge Cisco Systems as the original contributor of this module. All other licensing and usage conditions are those of the Apache Group.

Some of this code is derived from the free version of the file command originally posted to comp.sources.unix. Copyright info for that program is included below as required.

This software is not subject to any license of the American Telephone and Telegraph Company or of the Regents of the University of California.

Permission is granted to anyone to use this software for any purpose on any computer system, and to alter it and redistribute it freely, subject to the following restrictions:

The author is not responsible for the consequences of use of this software, no matter how awful, even if they arise from flaws in it.
The origin of this software must not be misrepresented, either by explicit claim or by omission. Since few users ever read sources, credits must appear in the documentation.
Altered versions must be plainly marked as such, and must not be misrepresented as being the original software. Since few users ever read sources, credits must appear in the documentation.
This notice may not be removed or altered.

For compliance with Mr Darwin's terms: this has been very significantly modified from the free "file" command.

all-in-one file for compilation convenience when moving from one version of Apache to the next.
Memory allocation is done through the Apache API's pool structure.
All functions have had necessary Apache API request or server structures passed to them where necessary to call other Apache API routines. (i.e., usually for logging, files, or memory allocation in itself or a called function.)
struct magic has been converted from an array to a single-ended linked list because it only grows one record at a time, it's only accessed sequentially, and the Apache API has no equivalent of realloc().
Functions have been changed to get their parameters from the server configuration instead of globals. (It should be reentrant now but has not been tested in a threaded environment.)
Places where it used to print results to stdout now saves them in a list where they're used to set the MIME type in the Apache request record.
Command-line flags have been removed since they will never be used here.

MimeMagicFile Directive

Description:	Enable MIME-type determination based on file contents using the specified magic file
Syntax:	`MimeMagicFile file-path`
Context:	server config, virtual host
Status:	Extension
Module:	mod_mime_magic

The MimeMagicFile directive can be used to enable this module, the default file is distributed at conf/magic. Non-rooted paths are relative to the ServerRoot. Virtual hosts will use the same file as the main server unless a more specific setting is used, in which case the more specific setting overrides the main server's file.

Example

MimeMagicFile conf/magic

저작자표시 비영리 변경금지

'보안 > 시큐어코딩' 카테고리의 다른 글

[시큐어코딩실습] 세션관리 취약점 제거 (0)	2014.10.22
[시큐어코딩실습] 로그인 횟수 제한 (0)	2014.10.15
JAVA 시큐어코딩 정오표 (0)	2014.10.05
spring 설정파일의 설정값 암호화 하기 (0)	2014.07.29
[시큐어코딩실습] XPath 인젝션 취약점 제거 (0)	2014.06.21

Open Expert Group

자바에서 파일 타입 확인하기

출처: http://www.rgagnon.com/javadetails/java-0487.html

(1) Java 7 버전에 포함되어 있는 nio 패키지를 사용하여 쉽게 파일의 타입을 확인할 수 있다.

(2) javax.activation.MimetypesFileTypeMap 을 사용하여 파일 타입을 체크할 수 있다.

(3) java.net.URL 을 사용한다.

(4) Apache Tika 을 이용하여 파일의 타입을 결정한다.

(5) JMimeMagic 라이브러리를 이용하여 파일의 타입을 결정한다.

(6) mime-util 툴을 이용하여 파일의 타입을 결정한다.

(7) Droid 을 사용하여 파일의 타입을 결정한다.

(8) Aperture framework 을 이용하여 파일 타입을 결정한다.

file magic numbers

Image files

Compressed files

Archive files

Excecutable files

Miscellaneous files

Summary

Format of the Magic File

Performance Issues

Notes

MimeMagicFile Directive

Example

'보안 > 시큐어코딩' 카테고리의 다른 글

티스토리툴바

자바에서 파일 타입 확인하기

출처: http://www.rgagnon.com/javadetails/java-0487.html

(1) Java 7 버전에 포함되어 있는 nio 패키지를 사용하여 쉽게 파일의 타입을 확인할 수 있다.

(2) javax.activation.MimetypesFileTypeMap 을 사용하여 파일 타입을 체크할 수 있다.

(3) java.net.URL 을 사용한다.

(4) Apache Tika 을 이용하여 파일의 타입을 결정한다.

(5) JMimeMagic 라이브러리를 이용하여 파일의 타입을 결정한다.

(6) mime-util 툴을 이용하여 파일의 타입을 결정한다.

(7) Droid 을 사용하여 파일의 타입을 결정한다.

(8) Aperture framework 을 이용하여 파일 타입을 결정한다.

file magic numbers

Summary

Example

'보안 > 시큐어코딩' 카테고리의 다른 글

'보안/시큐어코딩' Related Articles

티스토리툴바