Compare commits

...

21 Commits
v1.0.2 ... main

Author SHA1 Message Date
Nemo bb122223fd
Create FUNDING.yml 2022-05-27 07:22:57 +00:00
Vonter 2c386a3f2f Add README badges 2022-01-26 11:10:00 +05:30
Nemo e62284f3b0 update changelog 2021-12-31 13:15:36 +05:30
Nemo 55bfb6e26b
Merge pull request #20 from captn3m0/python-upgrade 2021-12-30 17:29:41 +05:30
Nemo be985dd40b [dep] switch from html5 to html5lib 2021-12-30 17:18:03 +05:30
Nemo c614de7efc [ci] Run tests on python3.10 2021-12-30 17:02:30 +05:30
Nemo f617c6fde5
Add Installation instructions
Closes #19
2021-07-21 18:17:00 +00:00
Nemo 5167dd4c8a
Merge pull request #18 from captn3m0/old-python
Support older python releases
2021-07-16 17:07:24 +05:30
Nemo dd8129aa2d Fix for older Python 2021-07-16 17:05:27 +05:30
Nemo 3ea18ff01b [tests] Add tests for argument parser 2021-07-16 16:57:09 +05:30
Nemo 2db41250f6 docs: Update docs to mention remote URL support 2021-07-05 13:34:45 +05:30
Nemo cc2a58bddc
Add Tests (#13)
Basic functional tests that cover 90% of the usecases. 
Doesn't cover zoomlevel, remote fetch yet.
2021-07-04 07:27:18 +00:00
Vonter af4752bee1
Merge pull request #11 from captn3m0/feature/external_url
Add basic implementation of external URL fetching of PDFs
2021-06-27 20:51:10 +05:30
Vonter 052060d256
Fix setup.cfg
Included validators
2021-06-27 17:57:38 +05:30
Vonter e70166efc2
Fix logged filename for locally cached file 2021-06-27 17:43:09 +05:30
Vonter 31faa1a36c
Add external URL fetching of PDFs
Also changed import order according to PEP8
2021-06-27 17:33:49 +05:30
Vonter ebc9c1e0cf
Update README.rst
Fixed attribute table
2021-06-27 00:15:26 +05:30
Vonter 1324c2e4aa
Merge pull request #10 from Vonter/feature/page_filter
Add PDF page selection/filter
2021-06-27 00:12:17 +05:30
Vonter 487e1002d4
Make defaultEnd correspond to absolute page number 2021-06-27 00:03:57 +05:30
Vonter 096b1f6be2
Add PDF page selection/filter 2021-06-26 22:56:38 +05:30
Nemo 4f505efde2 Add link to wiki 2021-06-26 18:05:47 +05:30
18 changed files with 307 additions and 57 deletions

3
.github/FUNDING.yml vendored Normal file
View File

@ -0,0 +1,3 @@
ko_fi: captn3m0
liberapay: captn3m0
github: captn3m0

29
.github/workflows/tests.yml vendored Normal file
View File

@ -0,0 +1,29 @@
name: Run Tests
on: push
jobs:
python:
runs-on: ubuntu-latest
strategy:
matrix:
python: ["3.7", "3.8", "3.9", "3.10"]
env:
PYTHON_VERSION: ${{matrix.python}}
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{matrix.python}}
uses: actions/setup-python@v2
with:
python-version: ${{matrix.python}}
- name: Install deps
run: |
python -m pip install --upgrade pip
pip install -e .[testing]
- name: Run pytest
run: |
pytest --cache-clear --cov=./ --cov-report=xml --cov-report=html
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v1
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./coverage.xml
env_vars: RUNNER_OS,PYTHON_VERSION,CI,GITHUB_SHA,RUNNER_OS,GITHUB_RUN_ID

View File

@ -2,6 +2,20 @@
Changelog
=========
Version 1.0.4
=============
- Switched from `html5` to `html5lib` as a dependency, since the former is unmaintained
- Python 3.10 is now supported
- Python 3.6 is no longer supported
Version 1.0.3
=============
- Added tests and code coverage
- PDFs can be directly fetched from Remote URLs
- PDFs can be filtered to have start and end pages
- Support for Python 3.6-3.8
- Removed --cleanup argument, since that is default
Version 1.0.2
=============
- Adds support for rotating PDFs

View File

@ -2,8 +2,35 @@
pystitcher
==========
.. image:: https://img.shields.io/pypi/v/pystitcher
:target: https://pypi.org/project/pystitcher/
:alt: PyPI Version
.. image:: https://img.shields.io/pypi/l/pystitcher
:target: LICENSE.txt
:alt: Repository License
.. image:: https://img.shields.io/github/checks-status/captn3m0/pystitcher/main
:target: https://github.com/captn3m0/pystitcher/actions?query=branch%3Amain
:alt: GitHub branch checks status
.. image:: https://img.shields.io/codecov/c/gh/captn3m0/pystitcher
:target: https://app.codecov.io/gh/captn3m0/pystitcher/
:alt: Codecov
|
pystitcher stitches your PDF files together, generating nice customizable bookmarks for you using a declarative input in the form of a markdown file. It is written in pure python and uses `PyPDF3 <https://pypi.org/project/PyPDF3/>`_ for reading and writing PDF files.
Installation
============
You can install it easily using `pipx <https://pypa.github.io/pipx/>`_::
pipx install pystitcher
The Wiki has `Alternative Installation Instructions <https://github.com/captn3m0/pystitcher/wiki/Installation>`_.
Description
===========
@ -38,8 +65,8 @@ Given this input::
# The Bills
- [Personal Data Protection Bill, 2019](1.a.pdf)
- [Personal Data Protection Bill, 2018](1.b.pdf)
- [Personal Data Protection Bill, 2019](https://example.com/2019-bill.pdf)
- [Personal Data Protection Bill, 2018](https://example.com/2018-bill.pdf)
# Other key reading material
@ -88,15 +115,29 @@ Configuration options can be specified with Meta data at the top of the file.
| | for more details. |
+---------------------+--------------------------------------------------------------------------+
Additionally, PDF links specified in markdown can have attributes to alter the PDFs before merging::
Additionally, PDF links specified in markdown can have attributes to alter the PDFs before merging. The below attribute will rotate the second PDF file by 90 degrees clockwise before merging::
[Part 1](1.pdf)
[Part 2](2.pdf){: rotate="90"}
The above will rotate the second PDF file by 90 degrees clockwise before merging. List of attributes:
And the below attribute will merge only pages 2 to 5, both inclusive, from the second PDF file::
+---------------------+---------------------------------------------+
| Attribute | Notes |
+=====================+=============================================+
| rotate | Rotate the PDF. Valid values are 90,180,270 |
+---------------------+---------------------------------------------+
[Part 1](1.pdf)
[Part 2](2.pdf){: start=2 end=5}
The list of available attributes are:
+---------------------+-----------------------------------------------+
| Attribute | Notes |
+=====================+===============================================+
| rotate | Rotate the PDF. Valid values are 90, 180, 270 |
+---------------------+-----------------------------------------------+
| start | Start page number for PDF page selection |
+---------------------+-----------------------------------------------+
| end | End page number for PDF page selection |
+---------------------+-----------------------------------------------+
Documentation
=============
Additional documentation is maintained on the `project wiki <https://github.com/captn3m0/pystitcher/wiki>`_ on GitHub.

View File

@ -36,16 +36,18 @@ package_dir =
=src
# Require a min/specific Python version (comma-separated conditions)
python_requires = >=3.6
python_requires = >=3.7
# PyPDF3: Read and write PDF files
# Markdown: Render input markdown file to HTML
# html5: Parse HTML file to generate bookmarks
# html5lib: Parse HTML file to generate bookmarks
# validators: Validate URL for fetching external PDF
install_requires =
importlib-metadata; python_version<"3.8"
PyPDF3>=1.0.4
Markdown>=3.3.4
html5>=0.0.9
html5lib>=1.1
validators>=0.18.1
[options.packages.find]
where = src
@ -80,9 +82,9 @@ console_scripts =
# in order to write a coverage file that can be read by Jenkins.
# CAUTION: --cov flags may prohibit setting breakpoints while debugging.
# Comment those flags to avoid this py.test issue.
addopts =
--cov pystitcher --cov-report term-missing
--verbose
addopts = --verbose
# --cov pystitcher --cov-report term-missing
norecursedirs =
dist
build

View File

@ -52,9 +52,10 @@ def parse_args(args):
)
parser.add_argument(
'--cleanup',
action=argparse.BooleanOptionalAction,
'--no-cleanup',
action='store_false',
default=True,
dest='cleanup',
help="Delete temporary files"
)

View File

@ -1,12 +1,17 @@
import os
import markdown
from .bookmark import Bookmark
import logging
import shutil
import tempfile
import urllib.request
import validators
import html5lib
import markdown
from PyPDF3 import PdfFileWriter, PdfFileReader
from PyPDF3.generic import FloatObject
from pystitcher import __version__
import tempfile
import logging
from .bookmark import Bookmark
_logger = logging.getLogger(__name__)
@ -24,6 +29,10 @@ class Stitcher:
DEFAULT_FIT = '/FitV'
# Do not rotate by default
DEFAULT_ROTATE = 0
# Start at page 1 by default
DEFAULT_START = 1
# End at the final page by default
DEFAULT_END = None
# TODO: This is a hack
os.chdir(self.dir)
@ -34,11 +43,27 @@ class Stitcher:
self.attributes = md.Meta
self.defaultFit = self._getAttribute('fit', DEFAULT_FIT)
self.defaultRotate = self._getAttribute('rotate', DEFAULT_ROTATE)
self.defaultStart = self._getAttribute('start', DEFAULT_START)
self.defaultEnd = self._getAttribute('end', DEFAULT_END)
document = html5lib.parseFragment(html, namespaceHTMLElements=False)
for e in document.iter():
self.iter(e)
"""
Check if file has been cached locally and if
not cached, download from provided URL. Return
download filename
"""
def _cacheURL(self, url):
if not os.path.exists(os.path.basename(url)):
_logger.info("Downloading PDF from remote URL %s", url)
with urllib.request.urlopen(url) as response, open(os.path.basename(url), 'wb') as downloadedFile:
shutil.copyfileobj(response, downloadedFile)
else:
_logger.info("Locally cached PDF found at %s", os.path.basename(url))
return os.path.basename(url)
"""
Get the number of pages in a PDF file
"""
@ -92,11 +117,17 @@ class Stitcher:
self.currentLevel = 3
elif(tag =='a'):
file = element.attrib.get('href')
rotate = element.attrib.get('rotate', self.defaultRotate)
if(validators.url(file)):
file = self._cacheURL(file)
fit = element.attrib.get('fit', self.defaultFit)
rotate = int(element.attrib.get('rotate', self.defaultRotate))
start = int(element.attrib.get('start', self.defaultStart))
end = int(element.attrib.get('end', self._get_pdf_number_of_pages(file)
if self.defaultEnd is None else self.defaultEnd))
filters = (rotate, start, end)
b = Bookmark(self.currentPage, element.text, self.currentLevel+1, fit)
self.files.append((file, self.currentPage, rotate))
self.currentPage += self._get_pdf_number_of_pages(file)
self.files.append((file, self.currentPage, filters))
self.currentPage += (end - start) + 1
if b:
self.bookmarks.append(b)
@ -133,7 +164,7 @@ class Stitcher:
self.bookmarks = bookmarks
"""
Gets the last bookmkark level at a given page number
Gets the last bookmark level at a given page number
on the combined PDF
"""
def _get_level_from_page_number(self, page):
@ -190,13 +221,14 @@ class Stitcher:
"""
def _merge(self, output):
writer = PdfFileWriter()
for (inputFile,startPage,rotate) in self.files:
for (inputFile,startPage,filters) in self.files:
assert os.path.isfile(inputFile), ERROR_PATH.format(inputFile)
reader = PdfFileReader(open(inputFile, 'rb'))
# Recursively iterate through the old bookmarks
self._iterate_old_bookmarks(reader, startPage, reader.getOutlines())
for page in range(1, reader.getNumPages()+1):
writer.addPage(reader.getPage(page - 1).rotateClockwise(int(rotate)))
rotate, start, end = filters
for page in range(start, end + 1):
writer.addPage(reader.getPage(page - 1).rotateClockwise(rotate))
writer.write(output)
output.close()

View File

@ -1,5 +1,6 @@
existing_bookmarks: remove
author: Wiki, the Cat
title: Super Jelly Book
subject: A book about adventures of Wiki, the cat.
keywords: wiki,potato,jelly
# Super Potato Book

View File

@ -0,0 +1,17 @@
existing_bookmarks: remove
author: Wiki, the Cat
subject: A book about adventures of Wiki, the cat.
keywords: wiki,potato,jelly
# Super Potato Book
# Volume 1
[Part 1](1.pdf)
# Volume 2
[Part 2](https://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf)
# Volume 3
[Part 3](https://juventudedesporto.cplp.org/files/sample-pdf_9359.pdf)

View File

@ -1,5 +1,4 @@
existing_bookmarks: flatten
title: Super Jelly Book
# Super Potato Book

10
tests/book-headings.md Normal file
View File

@ -0,0 +1,10 @@
# Heading 1
[Part 1](1.pdf)
## Heading 2
[Part 2](1.pdf)
### Heading 3
[Part 3](1.pdf)

18
tests/book-page-select.md Normal file
View File

@ -0,0 +1,18 @@
existing_bookmarks: keep
# Super Potato Book
# Volume 1
[Part 1](1.pdf){: start=1 end=2}
# Volume 2
[Part 2](2.pdf){: start=2}
# Volume 3
[Part 3](1.pdf){: end=2}
# Volume 4
[Part 4](2.pdf){: start=1 end=3 rotate="90"}

View File

@ -1,7 +1,4 @@
existing_bookmarks: remove
author: Wiki, the Cat
subject: A book about adventures of Wiki, the cat.
keywords: wiki,potato,jelly
# Super Potato Book
# Volume 1

15
tests/test_cli.py Normal file
View File

@ -0,0 +1,15 @@
from pystitcher.skeleton import parse_args
import logging
def test_default_args():
args = parse_args(['tests/book-clean.md', 'o.pdf'])
assert args.loglevel == None
assert args.cleanup == True
def test_loglevel():
args = parse_args(['-v', 'tests/book-clean.md', 'o.pdf'])
assert args.loglevel == logging.INFO
def test_cleanup():
args = parse_args(['--no-cleanup', 'tests/book-clean.md', 'o.pdf'])
assert args.cleanup == False

96
tests/test_integration.py Normal file
View File

@ -0,0 +1,96 @@
import os
import io
import PyPDF3
from pystitcher.stitcher import Stitcher
from pystitcher import __version__
import pytest
from contextlib import redirect_stdout
ROOT_DIR = os.path.dirname(os.path.abspath(__file__)) + "/../"
"""
Fixtures for the integration tests. Each test is a tuple consisting of 4 things:
- input name (used as book-{name}.md)
- total expected page count
- A dictionary of expected metadata. Leave empty if nothing is set
- A flattened list of expected bookmarks, with each bookmark as a tuple containing:
- Title
- Destination Page Number
- Bookmark Level (default = 0)
Each of the above 4 is passed to test_book as an argument
"""
TEST_DATA = [
("clean",6, {'Author': 'Wiki, the Cat', 'Title': 'Super Jelly Book', 'Subject': 'A book about adventures of Wiki, the cat.', 'Keywords': 'wiki,potato,jelly'}, [('Super Potato Book', 0, 0), ('Volume 1', 0, 0), ('Part 1', 0, 1), ('Volume 2', 3, 0), ('Part 2', 3, 1)]),
("keep",6, {'Title': 'Super Potato Book'}, [('Super Potato Book', 0, 0), ('Volume 1', 0, 0), ('Part 1', 0, 1), ('Chapter 1', 0, 2), ('Chapter 2', 1, 2), ('Scene 1', 1, 3), ('Scene 2', 2, 3), ('Volume 2', 3, 0), ('Part 3', 3, 1), ('Chapter 3', 3, 2), ('Chapter 4', 4, 2), ('Scene 3', 4, 3), ('Scene 4', 5, 3)]),
("flatten", 6, {}, [('Super Potato Book', 0, 0), ('Volume 1', 0, 0), ('Part 1', 0, 1), ('Chapter 1', 0, 2), ('Chapter 2', 1, 2), ('Scene 1', 1, 2), ('Scene 2', 2, 2), ('Volume 2', 3, 0), ('Part 3', 3, 1), ('Chapter 3', 3, 2), ('Chapter 4', 4, 2), ('Scene 3', 4, 2), ('Scene 4', 5, 2)]),
("rotate", 9, {}, [('Super Potato Book', 0, 0), ('Volume 1', 0, 0), ('Part 1', 0, 1), ('Volume 2', 3, 0), ('Part 2', 3, 1), ('Volume 3', 6, 0), ('Part 3', 6, 1)]),
("min",3, {}, [('Part 1', 0, 0), ('Chapter 1', 0, 1), ('Chapter 2', 1, 1), ('Scene 1', 1, 2), ('Scene 2', 2, 2)]),
("page-select", 9, {}, [('Super Potato Book', 0, 0), ('Volume 1', 0, 0), ('Part 1', 0, 1), ('Chapter 1', 0, 2), ('Chapter 2', 1, 2), ('Scene 1', 1, 3), ('Volume 2', 2, 0), ('Part 2', 2, 1), ('Scene 2', 2, 2), ('Chapter 3', 2, 2), ('Chapter 4', 3, 2), ('Scene 3', 3, 3), ('Volume 3', 4, 0), ('Part 3', 4, 1), ('Scene 4', 4, 2), ('Chapter 1', 4, 2), ('Chapter 2', 5, 2), ('Scene 1', 5, 3), ('Volume 4', 6, 0), ('Part 4', 6, 1), ('Scene 2', 6, 2), ('Chapter 3', 6, 2), ('Chapter 4', 7, 2), ('Scene 3', 7, 3), ('Scene 4', 8, 3)]),
("headings", 9, {'Title': 'Heading 1'}, [('Heading 1', 0, 0), ('Part 1', 0, 1), ('Heading 2', 3, 1), ('Part 2', 3, 2), ('Heading 3', 6, 2), ('Part 3', 6, 3)])
]
def pdf_name(name):
return "tests/%s.pdf" % name
def render(name, cleanup=True):
input_file = open("tests/book-%s.md" % name, 'r')
output_file = "%s.pdf" % name
stitcher = Stitcher(input_file)
stitcher.generate(output_file, cleanup)
# Switch back to main directory
os.chdir(ROOT_DIR)
return pdf_name(name)
def flatten_bookmarks(bookmarks, level=0):
"""Given a list, possibly nested to any level, return it flattened."""
output = []
for destination in bookmarks:
if type(destination) == type([]):
output.extend(flatten_bookmarks(destination, level+1))
else:
output.append((destination, level))
return output
def get_all_bookmarks(pdf):
""" Returns a list of all bookmarks with title, page number, and level in a PDF file"""
bookmarks = flatten_bookmarks(pdf.getOutlines())
return [(d[0]['/Title'], pdf.getDestinationPageNumber(d[0]), d[1]) for d in bookmarks]
@pytest.mark.parametrize("name,pages,metadata,bookmarks", TEST_DATA)
def test_book(name, pages, metadata, bookmarks):
output_file = render(name)
pdf = PyPDF3.PdfFileReader(output_file)
assert pages == pdf.getNumPages()
assert bookmarks == get_all_bookmarks(pdf)
info = pdf.getDocumentInfo()
identity = "pystitcher/%s" % __version__
assert identity == info['/Producer']
assert identity == info['/Creator']
for key in metadata:
assert info["/%s" % key] == metadata[key]
def test_rotation():
""" Validates the book-rotate.pdf with pages rotated."""
output_file = render("rotate")
pdf = PyPDF3.PdfFileReader(output_file)
# Note that inputs to getPage are 0-indexed
assert 90 == pdf.getPage(3)['/Rotate']
assert 90 == pdf.getPage(4)['/Rotate']
assert 90 == pdf.getPage(5)['/Rotate']
assert 180 == pdf.getPage(6)['/Rotate']
assert 180 == pdf.getPage(7)['/Rotate']
assert 180 == pdf.getPage(8)['/Rotate']
def test_cleanup_disabled():
f = io.StringIO()
with redirect_stdout(f):
output_file = render("min", False)
temp_filename = f.getvalue()[29:-1]
assert os.path.exists(temp_filename)
pdf = PyPDF3.PdfFileReader(temp_filename)
assert 3 == pdf.getNumPages()
assert [] == pdf.getOutlines()
# Clean it up manually to avoid cluttering
os.remove(temp_filename)

View File

@ -1,25 +0,0 @@
import pytest
from pystitcher.skeleton import fib, main
__author__ = "Nemo"
__copyright__ = "Nemo"
__license__ = "MIT"
def test_fib():
"""API Tests"""
assert fib(1) == 1
assert fib(2) == 1
assert fib(7) == 13
with pytest.raises(AssertionError):
fib(-10)
def test_main(capsys):
"""CLI Tests"""
# capsys is a pytest fixture that allows asserts agains stdout/stderr
# https://docs.pytest.org/en/stable/capture.html
main(["7"])
captured = capsys.readouterr()
assert "The 7-th Fibonacci number is 13" in captured.out