1. Products
  2.   Aspose.PDF
  3.   Java
  4.   Text Extractor

PDF Text Extractor for Java

Extract pure, raw, or plain text from PDF documents with Aspose.PDF Java Plugin

Text Extractor for Java

Make text extraction from PDF documents easier with the Aspose.PDF Text Extractor for Java plugin. This versatile tool provides three operation modes: pure, raw, and plain, offering flexibility and convenience for text extraction tasks in Java applications.

How to Extract Text from PDF via Java

  • Reference Aspose.PDF in your project
  • Set your license keys
  • Create instances of TextExtractorOptions
  • Add input PDF documents using TextExtractorOptions.AddDataSource
  • Call TextExtractorOptions.Process and assign the result to ResultContainer
  • Access the extracted text using ResultContainer.ResultCollection

Getting Started with PDF Text Extractor

Get the respective files from the downloads or fetch the package from Maven to add Aspose.PDF directly in your workspace.

  • Java Runtime Environment 6 or higher
  • Maven
  • Development environment like IntelliJ IDEA

How to Extract Text from Multiple PDFs

  • Reference Aspose.PDF for Java in your project
  • Set your license keys
  • Create instances of TextExtractor & TextExtractorOptions
  • Add input PDF documents using TextExtractorOptions.AddDataSource
  • Call TextExtractor.Process with an instance of TextExtractorOptions as a parameter
  • Get the result into an instance of ResultContainer
  • Access extracted text using ResultContainer.ResultCollection

Text Extractor's Operation Modes

  • The Pure mode enables text extraction from a PDF file with various formatting procedures, incorporating relative positions and introducing additional spaces to align text to the width of the page
  • The Raw mode extracts text from the PDF file without applying any formatting
  • The Plain mode extracts text from the PDF file, taking into account the relative positioning of text fragments, but unlike the “Pure” mode, it does not add extra spaces.


Common Use Cases for Text Extraction

  • Extracting data for reporting and analysis
  • Converting PDF documents to text for searchability
  • Pre-processing PDFs for Natural Language Processing (NLP) tasks
  • Creating textual summaries for PDF documents

Tips for Effective Text Extraction

  • Ensure proper licensing before using the plugin in production
  • Validate input PDFs, as corrupted files may lead to extraction errors
  • Experiment with different operation modes to find the best fit for your needs
  • Utilize logging to capture extraction errors and performance metrics

Frequently Asked Questions

What does Aspose.PDF Text Extractor for Java do?

Aspose.PDF Text Extractor for Java is a plugin designed for Java applications, offering text extraction from PDF documents with three modes of operation: Pure, Raw, and Plain. It defaults to ‘Raw’ mode, supports versatile input and output options, allows simultaneous processing of multiple PDF files, and provides customization for developers, making it a convenient solution for text extraction within Java environments.

What is the difference between Aspose.PDF for Java & Aspose.PDF Text Extractor for Java?

Aspose.PDF for Java is a robust Java API for a wide range of PDF tasks, including document generation, compression, table creation, and advanced features like importing and exporting PDF data. Aspose.PDF Text Extractor for Java is a specialized plugin focused solely on extracting text from PDF documents, emphasizing text extraction capabilities.

Is Aspose.PDF Text Extractor for Java limited to extracting text from PDFs?

Yes, PDF Text Extractor for Java is designed specifically for extracting text from PDF files. For other operations, you can use additional PDF plugins or the full capabilities of the Aspose.PDF library.

Does Aspose.PDF offer an online tool for PDF Text Extraction?

Yes, Aspose.PDF provides a free online PDF Text Parser tool for basic needs.

Where can I find Aspose.PDF Text Extraction examples in Java?

Discover our Landing Pages for Extract Text from PDF for Java