Compare commits

...

54 Commits

Author SHA1 Message Date
Nemo 2a35f3c68c [dep] Crystal Upgrade to 1.10.1
The Crystal Debian repo has moved, so we shift as well.
Debian 10 is still supported, so use it for now
2023-10-31 23:44:52 +05:30
Nemo dc43331609 [dep] Dependency Upgrade
Tested against 1.10.1
2023-10-31 23:42:35 +05:30
Nemo 24f4bb10c8
Create FUNDING.yml 2022-05-30 14:50:06 +05:30
Nemo 5fd0056d77 Dependency and version bump 2021-06-04 13:56:51 +05:30
Nemo 1e57857a4e Version Bump (1.3.0) 2020-07-01 18:29:44 +05:30
Nemo ba0a47038d Remove input-pdf from README and help 2020-07-01 18:29:22 +05:30
Nemo a4f5c03912
Merge pull request #8 from captn3m0/journal-support
Adds Journal Support
2020-07-01 18:27:39 +05:30
Nemo a05a1253db Keep going with next issue 2020-07-01 18:26:48 +05:30
Nemo 03fccde754 Adds support for final journal downloads 2020-06-30 18:36:01 +05:30
Nemo 3a2d45fb6e Adds a skip-open-access flag 2020-06-30 18:09:38 +05:30
Nemo 62e6a21c84 Finishes support for downloading complete issues 2020-06-30 17:36:44 +05:30
Nemo 38db0dd000 Adds tests for page detection 2020-06-30 16:50:49 +05:30
Nemo 919c8ac43f Fixes parser for issue HTML
This also adds .journal_title as an attribute to the Issue object
2020-06-30 15:19:12 +05:30
Nemo 870ed3080d Modular code in fetch to support both chapters and articles 2020-06-30 14:47:51 +05:30
Nemo f04e9b799e Removes input_pdf and initial work on article download 2020-06-30 14:18:19 +05:30
Nemo 04a2fe52ec Minor fixes, parse contents for issues 2020-06-30 14:08:28 +05:30
Nemo aa392eaa64 Adds support for parsing title to volume/number/date of a journal issue 2020-06-16 19:27:11 +05:30
Nemo c01e071328 [make] Adds tests to Makefile 2020-06-16 19:13:52 +05:30
Nemo 3e56efed52 Parses summary for issueS 2020-06-16 18:52:29 +05:30
Nemo 7b48731afe Parse title and publisher for issues 2020-06-16 18:52:29 +05:30
Nemo 6b278531fd Infobox is parsing for an issue now 2020-06-16 18:52:29 +05:30
Nemo f11f64b9d5 Adds webmock 2020-06-16 18:52:29 +05:30
Nemo ff225b12c6 Fix filenames with double-quotes 2020-06-16 18:52:29 +05:30
Nemo 4a358d0cb0 Journal parser now parses all issues 2020-06-16 18:52:29 +05:30
Nemo d8702b2fcb Initial work on parsing the journal page 2020-06-16 18:52:29 +05:30
Nemo fcc4f0c48b Clear out the Producer/Creator on the PDF 2020-06-16 18:52:28 +05:30
Nemo a23bd52ffa Fix Crystal and DL3008 issues 2020-05-14 03:40:42 +05:30
Nemo 3de4053037 [docker] Remove pinned versions 2020-05-14 01:31:38 +05:30
Nemo 487b222d79 Adds support for --dont-strip-first-page 2020-05-14 01:04:15 +05:30
Nemo d245538e33 Version bump 2020-04-22 18:32:37 +05:30
Nemo c3722430e1 Adds a check for rate-limit 2020-04-22 18:31:37 +05:30
Nemo a2db89ddf7 [docs] Fix docker badges 2020-04-21 19:34:39 +05:30
Prad Nelluru 5e5158fe1c
Don't backoff for more than 256 seconds (~4 min) (#13) 2020-04-21 17:56:25 +05:30
Nemo ebf1b57e22
Merge pull request #12 from pradn/better-errors
Improve error handling
2020-04-20 03:23:24 +05:30
Prad Nelluru 2206c41228 Use response.body, not response.body_io, which is nil when you pass in HTTPClient for some reason. 2020-04-19 17:50:06 -04:00
Prad Nelluru 4e435dd3ab Add 60s timeout to downloads. Do backoff for all errors. 2020-04-19 17:44:21 -04:00
Prad Nelluru 9659c0ef5e
Trim chapter titles to ensure bookmarks are valid in PDF (#11) 2020-04-20 02:03:30 +05:30
Prad Nelluru 762164e223 more descriptive error messages 2020-04-19 15:18:05 -04:00
Prad Nelluru 77201bda85 Fix download issue - revert to using body_io 2020-04-19 15:00:59 -04:00
Prad Nelluru db2d86c1a8 Also add exception message to top-level rescue 2020-04-19 14:49:41 -04:00
Prad Nelluru 1d2f53bad0 forgot to git-add new error files 2020-04-19 14:46:26 -04:00
Prad Nelluru 26d96d3f7d Remove assert that temp path be tmp. It has been changed to an actual random temp path so we can't test for it easily. 2020-04-19 02:40:42 -04:00
Prad Nelluru 5d9d951c9a Write backtrace in top-level rescue blocks. 2020-04-19 02:24:09 -04:00
Prad Nelluru 483f838d24 Report pdftk and download errors. Add exponential backoff to downloading after download failures. Add top-level rescue block to improve forward progress. 2020-04-19 01:58:20 -04:00
Nemo d52b06377d Version bump (1.1.2) 2020-04-05 18:58:28 +05:30
Nemo b7aad7a3c2 Add link to download message 2020-04-05 18:58:02 +05:30
Nemo 380f1f03f8 Put URL when skipping a file 2020-04-05 18:57:24 +05:30
Nemo 61005ab405 fix docker image to edge 2020-04-05 04:41:49 +05:30
Nemo 5ce11df239 [docker] Install Make 2020-04-05 03:08:57 +05:30
Nemo 449be5e554 Version bump 2020-04-05 02:55:35 +05:30
Nemo c08b8b7284 Show version in help 2020-04-05 02:55:19 +05:30
Nemo 1d95cce3f8 Catch another PDF error 2020-04-05 02:14:50 +05:30
Nemo aec6d853b3 Use latest release tag in docs 2020-04-04 03:53:03 +05:30
Nemo 78043e81a2 [docs] Adds list of docker images 2020-04-04 03:51:18 +05:30
31 changed files with 5522 additions and 132 deletions

3
.github/FUNDING.yml vendored Normal file
View File

@ -0,0 +1,3 @@
ko_fi: captn3m0
liberapay: captn3m0
github: captn3m0

View File

@ -12,9 +12,9 @@ install:
script:
- crystal spec
- crystal tool format --check
- git ls-files --exclude='Dockerfile*' --ignored | xargs --max-lines=1 ${HADOLINT}
- git ls-files --exclude='Dockerfile*' --ignored | xargs --max-lines=1 ${HADOLINT} --ignore DL3008
addons:
apt:
packages:
- pdftk
- pdftk

View File

@ -5,29 +5,33 @@ WORKDIR /build
COPY . .
# Add the key for the crystal debian repo
ADD https://keybase.io/crystal/pgp_keys.asc /tmp/crystal.gpg
ADD https://download.opensuse.org/repositories/devel:/languages:/crystal/Debian_10/Release.key /tmp/crystal.key
# See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=863199 for why mkdir is needed
RUN mkdir -p /usr/share/man/man1 && \
apt-get update && \
apt-get install --yes --no-install-recommends \
# Install gnupg for the apt-key operation
gnupg=2.2.12-1+deb10u1 \
gnupg \
# libssl for faster TLS in Crystal
libssl-dev=1.1.1d-0+deb10u2 \
libssl-dev \
# pdftk as a dependency for muse-dl
pdftk=2.02-5 \
# ca-certificates for talking to crystal-lang.org
ca-certificates=20190110 \
ca-certificates \
# git to let shards install happen
git=1:2.20.1-2+deb10u1 \
git \
# needed by myhtml crystal shard
make \
# build --release
zlib1g-dev=1:1.2.11.dfsg-1 && \
zlib1g-dev && \
# See https://crystal-lang.org/install/
apt-key add /tmp/crystal.gpg && \
echo "deb https://dist.crystal-lang.org/apt crystal main" > /etc/apt/sources.list.d/crystal.list && \
echo "deb http://download.opensuse.org/repositories/devel:/languages:/crystal/Debian_10/ /" | tee /etc/apt/sources.list.d/crystal.list && \
gpg --dearmor /tmp/crystal.key && \
mv /tmp/crystal.key.gpg /etc/apt/trusted.gpg.d/crystal.gpg && \
rm /tmp/crystal.key && \
apt-get update && \
apt-get install --no-install-recommends --yes crystal=0.33.0-1 && \
apt-get install --no-install-recommends --yes crystal && \
# Cleanup
apt-get clean && \
rm -rf /var/lib/apt/lists/*
@ -40,4 +44,4 @@ RUN apt-get --yes remove git gnupg
WORKDIR /data
VOLUME /data
ENTRYPOINT ["/usr/bin/muse-dl"]
ENTRYPOINT ["/usr/bin/muse-dl"]

View File

@ -7,4 +7,7 @@ release:
# Then extract the image | extract the layer.tar file (we only have one layer) | extract the muse-dl-static file
docker image save muse-dl-static | tar xf - --wildcards "*/layer.tar" -O | tar xf - "muse-dl-static"
# And move it to the bin/ directory
mv -f muse-dl-static bin/
mv -f muse-dl-static bin/
test:
crystal spec

View File

@ -1,4 +1,4 @@
# muse-dl ![Travis (.org)](https://img.shields.io/travis/captn3m0/muse-dl) ![GitHub issues](https://img.shields.io/github/issues/captn3m0/muse-dl) ![GitHub issues by-label](https://img.shields.io/github/issues/captn3m0/muse-dl/bug?color=red&label=open%20bugs) ![GitHub](https://img.shields.io/github/license/captn3m0/muse-dl) ![GitHub top language](https://img.shields.io/github/languages/top/captn3m0/muse-dl) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)
# muse-dl ![Travis (.org)](https://img.shields.io/travis/captn3m0/muse-dl) ![GitHub issues](https://img.shields.io/github/issues/captn3m0/muse-dl) ![GitHub issues by-label](https://img.shields.io/github/issues/captn3m0/muse-dl/bug?color=red&label=open%20bugs) ![GitHub](https://img.shields.io/github/license/captn3m0/muse-dl) ![GitHub top language](https://img.shields.io/github/languages/top/captn3m0/muse-dl) [![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](http://makeapullrequest.com) ![Docker Cloud Automated build](https://img.shields.io/docker/cloud/automated/captn3m0/muse-dl) ![Docker Cloud Build Status](https://img.shields.io/docker/cloud/build/captn3m0/muse-dl) ![Docker Image Size (latest semver)](https://img.shields.io/docker/image-size/captn3m0/muse-dl)
Download PDFs from Project MUSE and stitch them together into a single-file using [`pdftk`](https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/).
@ -28,15 +28,26 @@ A docker image is available at `captn3m0/muse-dl` on Docker Hub. The working dir
```
# Download the book, and put it in your Downloads directory
docker run -it /home/nemo/Downloads:/data captn3m0/muse-dl https://muse.jhu.edu/book/875
docker run -it /home/nemo/Downloads:/data captn3m0/muse-dl:edge https://muse.jhu.edu/book/875
# If you have a list.txt file in your Downloads directory, then you can run
docker run -it /home/nemo/Downloads:/data captn3m0/muse-dl /data/list.txt
# If you have a list.txt file in your Downloads directory, then you can run
docker run -it /home/nemo/Downloads:/data captn3m0/muse-dl:edge /data/list.txt
# If you want to keep the temporary files with your host, and not delete them
docker run -it /home/nemo/Downloads:/data /tmp:/musetmp --tmp-dir /musetmp --no-cleanup https://muse.jhu.edu/book/875
docker run -it /home/nemo/Downloads:/data /tmp:/musetmp captn3m0/muse-dl:edge --tmp-dir /musetmp --no-cleanup https://muse.jhu.edu/book/875
```
Replace edge with the latest version number if you'd like to run a tagged release.
### Docker Images
The following images are available:
- `edge`: Run `muse-dl` against latest master.
- `edge-static`: Get the pre-built static-binary against latest master.
- `v1.3.1`: Run `muse-dl` against the specific release.
- `v1.3.1-static`: Get the pre-built static binary against the specific release.
## Requirements
Please ensure you have `pdftk` installed, unless you're running via docker.
@ -53,8 +64,8 @@ INPUT_FILE: Path to a file containing a list of links
--tmp-dir PATH Temporary Directory to use
--output FILE Output Filename
--no-bookmarks Don't add bookmarks in the PDF
--input-pdf INPUT Input Stitched PDF. Will not download anything
--clobber Overwrite the output file, if it already exists. Not compatible with input-pdf
--clobber Overwrite the output file, if it already exists.
--dont-strip-first-page Disables first page from being stripped. Use carefully
--cookie COOKIE Cookie-header
-h, --help Show this help
```
@ -74,4 +85,4 @@ And it will download all the links in that file.
## License
Licensed under the [MIT License](https://nemo.mit-license.org/). See LICENSE file for details.
Licensed under the [MIT License](https://nemo.mit-license.org/). See LICENSE file for details.

View File

@ -1,14 +1,22 @@
version: 1.0
version: 2.0
shards:
crest:
github: mamantoha/crest
version: 0.24.1
git: https://github.com/mamantoha/crest.git
version: 1.3.12
http-client-digest_auth:
github: mamantoha/http-client-digest_auth
version: 0.3.0
git: https://github.com/mamantoha/http-client-digest_auth.git
version: 0.6.0
http_proxy:
git: https://github.com/mamantoha/http_proxy.git
version: 0.10.1
myhtml:
github: kostya/myhtml
version: 1.5.1
git: https://github.com/kostya/myhtml.git
version: 1.5.8
webmock:
git: https://github.com/manastech/webmock.cr.git
version: 0.14.0+git.commit.42b347cdd64e13193e46167a03593944ae2b3d20

View File

@ -1,5 +1,5 @@
name: muse-dl
version: 1.1.0
version: 1.3.1
authors:
- Nemo <muse.dl@captnemo.in>
@ -15,4 +15,9 @@ dependencies:
myhtml:
github: kostya/myhtml
crest:
github: mamantoha/crest
github: mamantoha/crest
development_dependencies:
webmock:
github: manastech/webmock.cr
branch: master

View File

@ -1,7 +1,12 @@
require "./spec_helper"
require "webmock"
# require "errors/muse_corrupt_pdf.cr"
describe Muse::Dl::Book do
headers = {"Content-Type" => "text/html"}
WebMock.stub(:get, "https://muse.jhu.edu/chapter/2379787/pdf")
.to_return(body_io: File.new("spec/fixtures/chapter-2379787.html"), headers: headers)
it "should notice the unable to construct chapter PDF error" do
f = "/tmp/chapter-2379787.pdf"
File.delete(f) if File.exists? f

359
spec/fixtures/chapter-2379787.html vendored Normal file
View File

@ -0,0 +1,359 @@
<style>
.page404 {
display: table;
width: 100%;
padding: 60px 4em;
min-height: 350px;
}
.page404 .int {
display: table-cell;
vertical-align: middle;
text-align: left;
}
.page404 h4 {
margin-bottom: 10px;
font-weight: 700;
}
.page404 .logo {
display: table-cell;
width: 23%;
vertical-align: middle;
padding-right: 30px;
}
.page404 blockquote {
border: none;
padding-left: 0;
}
</style>
<!DOCTYPE html>
<html lang="en">
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-58347753-2"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-58347753-2');
</script>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta property="og:image" content="/images/muselogo_dark.jpg" />
<title>Project MUSE</title>
<link rel="search" type="application/opensearchdescription+xml" title="Search Project MUSE from your browser's Searchbar" href="/plugins/muse-opensearch.xml" />
<link rel="stylesheet" type="text/css" href="/css/normalize.css"/>
<link href="/css/jquery.qtip2.css" rel="stylesheet" type="text/css" />
<!-- foundation 6.4.1 custom float/typ/vis 250rem max width 30col float grid -->
<link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro:300,400,400i,600,600i,700,700i" rel="stylesheet">
<link rel="stylesheet" type="text/css" href="/css/foundation.min.css"/>
<link rel="stylesheet" type="text/css" href="/css/style_home2.css?031820"/>
<script type="text/javascript" src="/js/jquery3.js"></script>
<script type="text/javascript" src="/js/pre.js"></script>
<script type="text/javascript" src="/js/core/head.js?new"></script>
<script type="text/javascript" src="https://s7.addthis.com/js/250/addthis_widget.js#pubid=ra-4ecb5479089cb81a"></script>
<title>Article</title>
</head>
<body>
<a id="skip" href="#skip_target">[Skip to Content]</a>
<span id="top"></span>
<div id="header" role="banner" aria-label="header">
<div class="row wrap" id="institution_banner">
<div class="content">
<div id="institution_wrap" class="columns small-15 medium-text-left">
<div id="institution" class="img_text_col">
<div class="img_contain_left"><img src="/images/institution.png" alt="institution icon" /></div>
<div class="text_contain_left"><span class="small"><a href='/account' class='color_white login_status'>Institutional Login</a></span></div>
</div>
</div>
<div id="person_wrap" class="columns small-15">
<div id="person" class="img_text_col">
<div class="img_contain_right"><img src="/images/person.png" alt="account icon" /></div>
<div class="text_contain_right"><span class="small"><a href="/account/" class="color_white login_status" onclick="gtag('event', 'click', {'event_category': 'Account link', 'event_label': 'account name link - header'});">LOG IN</a></span></div>
</div>
</div>
</div>
</div>
<div class="row wrap" id="search_banner">
<div class="content">
<div class="medium-4 small-4 columns" id="header_logo_wrap">
<div id="header_logo">
<a href="/"><img src="/images/muselogo.png" alt="Project MUSE" class="show-for-large"/>
<img src="/images/muselogo_notext.png" alt="Project MUSE" class="hide-for-large"/></a>
</div>
</div>
<div class="medium-21 small-22 columns" id="search_bar_wrap">
<div class="row">
<div id="browse_button_wrap">
<a id="browse_button" href="/browse" onclick="gtag('event', 'click', {'event_category': 'Browse link', 'event_label': 'browse button - header'});"><span class="small">browse</span></a>
</div>
<div id="or_text_wrap" class="show-for-medium">
<div id="or_text">
<span class="small">or</span>
</div>
</div>
<div id="search_input_wrap" class="small-30">
<div id="search_input">
<noscript>
<form method="post" action="/search/">
<input name="no_js_header_query"/>
<input type="hidden" name="action" value="search"/>
<input type="hidden" name="t" value="header"/>
<a id="search_button">
<input type="image" src="/images/search_white.png" alt="Search icon"/>
</a>
</form>
</noscript>
<script>document.write('<input name="search_input_header" id="search_input_header" aria-label="search input"/>');</script>
<script>document.write('<a id="search_button"><img src="/images/search_white.png" alt="Search icon"/></a>');</script>
</div>
</div>
</div>
</div>
<div class="medium-5 small-4 columns" id="menu_wrap">
<div id="menu" class="menu-btn">
<div class="nav-toggle">
<div class="nav-toggle-btn">
<a href="#" class="menu-icon-wrap">
<span class="icon"></span>
<span class="small show-for-large">menu</span>
</a>
</div>
<div class="nav-mobile">
<a href="/search">Advanced Search</a>
<a href="/browse">Browse</a>
<script>
document.write('<div class="accordion">');
</script>
<noscript>
<div class="accordion noscript">
</noscript>
<a href="#" class="acc_trig open"><span>MyMUSE Account</span></a>
<div class="acc_block">
<a href="/account">Log In / Sign Up</a>
<a href="/account/change">Change My Account</a>
<a href="/account/user_settings">User Settings</a>
<a href="/account/">Access via Institution</a>
<a href="/account/saved_items">MyMUSE Library</a>
<a href="/account/search_history">Search History</a>
<a href="/account/view_history">View History</a>
<a href="/account/purchase_history">Purchase History</a>
<a href="/account/alerts">MyMUSE Alerts</a>
</div>
</div>
<div class="nav-mobile-footer">
<!--<a class="modal_trigger">Contact Support</a>-->
<a href="/contact">Contact Support</a>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<div class="page404" id="main">
<div class="logo">
<img src="/images/muselogo_notext.png" alt="MUSE logo">
</div>
<div class="int">
<html><head><title>Error</title></head><body>Unable to construct chapter PDF</body></html>
</div>
</div>
<div id="footer_block" role="banner" aria-label="footer">
<div class="content">
<div class="wrap row" id="about_wrap">
<div id="about">
<h3>Project MUSE Mission</h3>
<p>Project MUSE promotes the creation and dissemination of essential humanities and social science resources through collaboration with libraries, publishers, and scholars worldwide. Forged from a partnership between a university press and a library, Project MUSE is a trusted part of the academic and scholarly community it serves.</p>
</div>
<div id="about_logo" class="columns medium-10 show-for-large">
<img src="/images/muselogo_notext.png" alt="MUSE logo"/>
</div>
</div>
</div>
<div class="footer_main">
<div class="footer_item_color wrap">
<div class="footer_item_left">
<div class="group">
<div class="footer_item_about cont_sub">
<h5 class="small">about</h5>
<ul>
<li><a href="https://about.muse.jhu.edu/publishers">Publishers</a></li>
<li><a href="https://about.muse.jhu.edu/about/discovery-partners/">Discovery Partners</a></li>
<li><a href="https://about.muse.jhu.edu/about/advisory-board/">Advisory Board</a></li>
<li><a href="https://about.muse.jhu.edu/about/journal-subscribers/">Journal Subscribers</a></li>
<li><a href="https://about.muse.jhu.edu/about/book-customers">Book Customers</a></li>
<li><a href="https://about.muse.jhu.edu/about/at-conferences/">Conferences</a></li>
</ul>
</div>
<div class="footer_item_res cont_sub">
<h5 class="small">resources</h5>
<ul>
<li><a href="https://about.muse.jhu.edu/resources/news/">News & Announcements</a></li>
<li><a href="https://about.muse.jhu.edu/resources/promotional-materials">Promotional Material</a></li>
<li><a href="https://about.muse.jhu.edu/resources/alerts">Get Alerts</a></li>
<li><a href="https://about.muse.jhu.edu/resources/muse-presentations">Presentations</a></li>
</ul>
</div>
<div class="clear"></div>
</div>
<div class="group">
<div class="footer_item_what cont_sub">
<h5 class="small">what's on muse</h5>
<ul>
<li><a href="https://about.muse.jhu.edu/muse">Open Access</a></li>
<li><a href="https://about.muse.jhu.edu/pub/journals">Journals</a></li>
<li><a href="https://about.muse.jhu.edu/pub/books">Books</a></li>
</ul>
</div>
<div class="footer_item_info cont_sub">
<h5 class="small">information for</h5>
<ul>
<li><a href="https://about.muse.jhu.edu/publishers">Publishers</a></li>
<li><a href="https://about.muse.jhu.edu/librarians">Librarians</a></li>
<li><a href="https://about.muse.jhu.edu/individuals">Individuals</a></li>
</ul>
</div>
<div class="clear"></div>
</div>
</div>
<div class="footer_item_right">
<div class="group">
<div class="footer_item_social cont_sub">
<h5 class="small">Contact</h5>
<ul>
<li class="clear"><a href="/contact">Contact Us</a></li>
<li><a href="https://about.muse.jhu.edu/resources/help-overview">Help</a></li>
</ul>
<ul>
<li>
<ol class="social_icons">
<li class="list_h"><a href="https://www.facebook.com/ProjectMUSE"><img src="/images/footer_icon_fb.png" alt="Facebook" /></a></li>
<li class="list_h"><a href="https://www.linkedin.com/company/projectmuse/"><img src="/images/footer_icon_linkedin.png" alt="Linkedin" /></a></li>
<li class="list_h"><a href="https://twitter.com/ProjectMUSE"><img src="/images/footer_icon_twitter.png" alt="Twitter" /></a></li>
</ol>
</li>
</ul>
</div>
<div class="footer_item_policy cont_sub">
<h5 class="small">Policy & Terms</h5>
<ul>
<li><a href="https://about.muse.jhu.edu/about/accessibility/">Accessibility</a></li>
<li><a href="/privacy_policy">Privacy Policy</a></li>
<li><a href="/terms_use">Terms of Use</a></li>
</ul>
</div>
<div class="clear"></div>
</div>
<div class="group">
<div class="footer_item_addr cont_sub">
<p class="address"><span>2715 North Charles Street<br/>Baltimore, Maryland, USA 21218</span></p>
<p class="phone"><span><a href="tel:1-410-516-6989">+1 (410) 516-6989</a></span><br>
<span><a href="mailto:muse@press.jhu.edu">muse@press.jhu.edu</a></span></p>
<p class="footer_text_sm copy color_oxfordblue hide-for-small"><span>&copy;2020 Project MUSE. Produced by Johns Hopkins University Press in collaboration with The Sheridan Libraries.</span></p>
</div>
<div class="footer_item_logo cont_sub">
<p class="show-for-medium"><span class="semiboldit footer_text_sm">Now and always,<br/>The Trusted Content Your Research Requires.</span></p>
<p><span><a href="https://muse.jhu.edu">
<img class="show-for-medium" src="/images/muselogoblack.png" alt="Project MUSE logo" />
<img class="hide-for-medium" src="/images/muselogo.png" alt="Project MUSE logo" /></a></span></p>
<p class="hide-for-medium"><span class="semiboldit footer_text_sm">Now and always, The Trusted Content Your Research Requires.</span></p>
<p class="hide-for-small"><span class="footer_text_sm">Built on the Johns Hopkins University Campus</span></p>
</div>
<div class="clear"></div>
</div>
</div>
<div class="clear"></div>
</div>
</div>
<div class="footer_item_sub wrap hide-for-medium">
<p><span class="footer_text_sm">Built on the Johns Hopkins University Campus</span></p>
<p class="footer_text_sm copy color_oxfordblue"><span>&copy;2020 Project MUSE. Produced by Johns Hopkins University Press in collaboration with The Sheridan Libraries.</span></p>
</div>
</div>
<div id="btn_top">
<a href="#top"><span>Back To Top</span></a>
</div>
<input type="hidden" name="cookie_acknowledgement_type" id="cookie_acknowledgement_type" value="cookie_acknowledgement">
<div id="cookies_msg">
<p>This website uses cookies to ensure you get the best experience on our website. Without cookies your experience may not be seamless.</p>
<script>document.writeln('<a href="javascript://" class="btn_accept" id="accept_cookie_msg">Accept</a>');</script>
<noscript>
<form method="post" action="/account/set_attribute_no_ajax/cookie_acknowledgement/1">
<input type="submit" class="btn_accept" value="accept">
</form>
</noscript>
</div>
<script type="text/javascript" src="/js/lightbox.js"></script>
<script type="text/javascript" src="/js/jquery.qtip2.min.js"></script>
<script type="text/javascript" src="/js/post.js"></script>
<script type="text/javascript" src="/js/footnotes.js"></script>
<script type="text/javascript" src="/js/references.js"></script>
</body>
</html>

1263
spec/fixtures/issue-35852.html vendored Normal file

File diff suppressed because it is too large Load Diff

1603
spec/fixtures/issue-41793.html vendored Normal file

File diff suppressed because it is too large Load Diff

1522
spec/fixtures/journal-159.html vendored Normal file

File diff suppressed because it is too large Load Diff

65
spec/fixtures/ratelimit.html vendored Normal file
View File

@ -0,0 +1,65 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Too Many Free PDF Requests</title>
<style>
body {
margin: 0;
padding: 0;
}
.page429 {
display: table;
width: 100%;
padding: 60px 30px;
box-sizing: border-box;
min-height: 350px;
font-family: sans-serif;
}
.page429 .int {
display: table-cell;
vertical-align: middle;
text-align: left;
padding-left: 30px;
}
.page429 h4 {
margin-bottom: 10px;
font-weight: 700;
font-size: 24px;
}
.page429 .logo {
display: table-cell;
width: 23%;
max-width: 182px;
vertical-align: middle;
}
.page429 .logo img {
max-width: 100%;
height: auto;
}
.page429 p {
font-weight: normal;
line-height: 1.3;
}
.page429 a {
text-decoration: none;
color: #284f84;
}
</style>
</head>
<body>
<div class="page429" id="main">
<div class="logo">
<a href="https://muse.jhu.edu"><img src="/images/muselogo_notext.png" alt="MUSE logo"></a>
</div>
<div class="int">
<h4>Too Many Free PDF Requests</h4>
<p>Your IP has requested too many free PDFs too quickly.</p>
<p>Please wait before you continue downloading, and if possible slow down the rate of your requests.</p>
</div>
</div>
</body>
</html>

85
spec/issue_spec.cr Normal file
View File

@ -0,0 +1,85 @@
require "../src/issue"
require "./spec_helper"
require "webmock"
describe Muse::Dl::Issue do
WebMock.stub(:get, "https://muse.jhu.edu/issue/41793")
.to_return(body: File.new("spec/fixtures/issue-41793.html").gets_to_end)
issue = Muse::Dl::Issue.new "41793"
issue.parse
it "should initialize correctly" do
issue.id.should eq "41793"
issue.url.should eq "https://muse.jhu.edu/issue/41793"
end
it "should parse info correctly" do
issue.info["ISSN"].should eq "1530-7131"
issue.info["Print ISSN"].should eq "1531-2542"
issue.info["Launched on MUSE"].should eq "2020-02-05"
issue.info["Open Access"].should eq "No"
issue.title.should eq "Volume 20, Number 1, January 2020"
end
it "should parse title correctly" do
issue.volume.should eq "20"
issue.number.should eq "1"
issue.date.should eq "January 2020"
end
it "should parser summary" do
issue.summary.should eq <<-EOT
Focusing on important research about the role of academic libraries and librarianship, portal also features commentary on issues in technology and publishing. Written for all those interested in the role of libraries within the academy, portal includes peer-reviewed articles addressing subjects such as library administration, information technology, and information policy. In its inaugural year, portal earned recognition as the runner-up for best new journal, awarded by the Council of Editors of Learned Journals (CELJ). An article in portal, "Master's and Doctoral Thesis Citations: Analysis and Trends of a Longitudinal Study," won the Jesse H. Shera Award for Distinguished Published Research from the Library Research Round Table of the American Library Association.
EOT
end
it "should parse publisher" do
issue.publisher.should eq "Johns Hopkins University Press"
end
it "should parse the journal title" do
issue.journal_title.should eq "portal: Libraries and the Academy"
end
it "should parse non-numbered issues" do
WebMock.stub(:get, "https://muse.jhu.edu/issue/35852")
.to_return(body: File.new("spec/fixtures/issue-35852.html").gets_to_end)
issue = Muse::Dl::Issue.new "35852"
issue.parse
issue.volume.should eq "1"
issue.number.should eq "2"
issue.date.should eq "2016"
issue.info["ISSN"].should eq "2474-9419"
issue.info["Print ISSN"].should eq "2474-9427"
issue.info["Launched on MUSE"].should eq "2017-02-21"
issue.info["Open Access"].should eq "Yes"
issue.title.should eq "Volume 1, Issue 2, 2016"
issue.journal_title.should eq "Constitutional Studies"
expected_pages = [
[1, 22],
[23, 40],
[41, 58],
[59, 80],
[81, 95],
[97, 116],
]
expected_titles = [
"The Limits of Veneration: Public Support for a New Constitutional Convention",
"Secession and Nullification as a Global Trend",
"Challenging Constitutionalism in Post-Apartheid South Africa",
"Democracy by Lawsuit: Or, Can Litigation Alleviate the European Unions “Democratic Deficit?”",
"Private Enforcement of Constitutional Guarantees in the Ku Klux Act of 1871",
"Sober Second Thoughts: Evaluating the History of Horizontal Judicial Review by the U.S. Supreme Court",
]
issue.articles.each_with_index do |a, i|
a.start_page.should eq expected_pages[i][0]
a.end_page.should eq expected_pages[i][1]
a.title.should eq expected_titles[i]
end
end
end

28
spec/journal_spec.cr Normal file
View File

@ -0,0 +1,28 @@
require "./spec_helper"
describe Muse::Dl::Journal do
html = File.new("spec/fixtures/journal-159.html").gets_to_end
j = Muse::Dl::Journal.new html
it "it should parse the infobox for 159" do
j.info["ISSN"].should eq "1530-7131"
j.info["Print ISSN"].should eq "1531-2542"
j.info["Coverage Statement"].should eq "Vol. 1 (2001) through current issue"
j.info["Open Access"].should eq "No"
end
it "should parser summary" do
j.summary.should eq <<-EOT
Focusing on important research about the role of academic libraries and librarianship, portal also features commentary on issues in technology and publishing. Written for all those interested in the role of libraries within the academy, portal includes peer-reviewed articles addressing subjects such as library administration, information technology, and information policy. In its inaugural year, portal earned recognition as the runner-up for best new journal, awarded by the Council of Editors of Learned Journals (CELJ). An article in portal, "Master's and Doctoral Thesis Citations: Analysis and Trends of a Longitudinal Study," won the Jesse H. Shera Award for Distinguished Published Research from the Library Research Round Table of the American Library Association.
EOT
end
it "should parse publisher" do
j.publisher.should eq "Johns Hopkins University Press"
end
it "should return issues" do
j.issues[0].id.should eq "41793"
j.issues[-1].id.should eq "1578"
end
end

View File

@ -13,7 +13,6 @@ describe Muse::Dl::Parser do
parser = Muse::Dl::Parser.new(["https://muse.jhu.edu/book/68534"])
parser.bookmarks.should eq true
parser.cleanup.should eq true
parser.tmp.should eq "/tmp"
parser.output.should eq "tempfilename.pdf"
parser.url.should eq "https://muse.jhu.edu/book/68534"
end

9
spec/util_spec.cr Normal file
View File

@ -0,0 +1,9 @@
require "../src/util"
require "./spec_helper"
describe Muse::Dl::Util do
it "should sanitize filenames properly" do
fn = Muse::Dl::Util.slug_filename("Hello world - \" :A$3, a story; a poem|chapter")
fn.should eq "Hello world - - -A-3, a story- a poem-chapter"
end
end

19
src/article.cr Normal file
View File

@ -0,0 +1,19 @@
require "./infoparser.cr"
require "./issue.cr"
module Muse::Dl
class Article
getter id : String, :start_page, :end_page, :title
setter title : String | Nil, start_page : Int32 | Nil, end_page : Int32 | Nil
def initialize(id : String)
@id = id
@url = "https://muse.jhu.edu/article/#{id}"
end
# TODO: Fix this
def open_access
return false
end
end
end

View File

@ -0,0 +1,4 @@
module Muse::Dl::Errors
class DownloadError < Exception
end
end

View File

@ -1,4 +0,0 @@
module Muse::Dl::Errors
class MissingChapter < Exception
end
end

View File

@ -0,0 +1,4 @@
module Muse::Dl::Errors
class MissingFile < Exception
end
end

View File

@ -0,0 +1,4 @@
module Muse::Dl::Errors
class PDFOperationError < Exception
end
end

View File

@ -4,7 +4,8 @@ require "myhtml"
module Muse::Dl
class Fetch
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
DOWNLOAD_TIMEOUT_SECS = 60
HEADERS = {
"User-Agent" => USER_AGENT,
@ -13,6 +14,10 @@ module Muse::Dl
"Connection" => "keep-alive",
}
def self.article_file_name(id : String, tmp_path : String)
"#{tmp_path}/article-#{id}.pdf"
end
def self.chapter_file_name(id : String, tmp_path : String)
"#{tmp_path}/chapter-#{id}.pdf"
end
@ -22,9 +27,83 @@ module Muse::Dl
File.delete(fns) if File.exists?(fns)
end
def self.save_chapter(tmp_path : String, chapter_id : String, chapter_title : String, cookie : String | Nil = nil, add_bookmark = true)
def self.cleanup_articles(tmp_path : String, id : String)
fns = article_file_name(id, tmp_path)
File.delete(fns) if File.exists?(fns)
end
def self.save_url(url : String, referer : String, file_name : String, tmp_path : String, cookie : String | Nil = nil, bookmark_title : String | Nil = nil, strip_first_page = true)
tmp_pdf_file = "#{file_name}.tmp"
if File.exists? file_name
puts "#{file_name} already downloaded"
return
end
uri = URI.parse(url)
http_client = HTTP::Client.new(uri)
# Raise a IO::TimeoutError after 60 seconds.
http_client.read_timeout = DOWNLOAD_TIMEOUT_SECS
headers = HEADERS.merge({
"Referer" => referer,
})
if cookie
headers["Cookie"] = cookie
end
request = Crest::Request.new(:get, url, headers: headers, max_redirects: 0, handle_errors: false)
begin
response = request.execute
rescue ex : IO::TimeoutError
raise Muse::Dl::Errors::DownloadError.new("Error downloading #{url}. Download took longer than #{DOWNLOAD_TIMEOUT_SECS} seconds.")
end
# TODO: Add validation for the downloaded file (should be PDF)
if !response.success?
raise Muse::Dl::Errors::DownloadError.new("Error downloading chapter. HTTP response code: #{response.status}")
end
content_type = response.headers["Content-Type"]
if content_type.is_a? String
if /html/.match content_type
response.body.each_line do |line|
# https://muse.jhu.edu/chapter/2383438/pdf
# https://muse.jhu.edu/book/67393
# Errors are Unable to determine page runs / Unable to construct chapter PDF
if /Unable to/.match line
raise Muse::Dl::Errors::MuseCorruptPDF.new("Error: MUSE is unable to generate PDF for #{url}")
end
if /Your IP has requested/.match line
raise Muse::Dl::Errors::DownloadError.new("Error: MUSE Rate-limit reached")
end
end
end
end
File.open(tmp_pdf_file, "w") do |file|
file << response.body
if file.size == 0
raise Muse::Dl::Errors::DownloadError.new("Error: downloaded chapter file size is zero. Response Content-Length header was #{headers["Content-Length"]}")
end
end
pdftk = Muse::Dl::Pdftk.new tmp_path
pdftk.strip_first_page tmp_pdf_file if strip_first_page
if bookmark_title
# Run pdftk and add the bookmark to the file
pdftk.add_bookmark tmp_pdf_file, bookmark_title
end
# Now we can move the file to the proper PDF filename
File.rename tmp_pdf_file, file_name
end
def self.save_chapter(tmp_path : String, chapter_id : String, chapter_title : String, cookie : String | Nil = nil, add_bookmark = true, strip_first_page = true)
final_pdf_file = chapter_file_name chapter_id, tmp_path
tmp_pdf_file = "#{final_pdf_file}.tmp"
if File.exists? final_pdf_file
puts "#{chapter_id} already downloaded"
@ -33,49 +112,22 @@ module Muse::Dl
# TODO: Remove this hardcoding, and make this more generic by generating it within the Book class
url = "https://muse.jhu.edu/chapter/#{chapter_id}/pdf"
headers = HEADERS.merge({
"Referer" => "https://muse.jhu.edu/verify?url=%2Fchapter%2F#{chapter_id}%2Fpdf",
})
referer = "https://muse.jhu.edu/verify?url=%2Fchapter%2F#{chapter_id}%2Fpdf"
if cookie
headers["Cookie"] = cookie
end
save_url(url, referer, final_pdf_file, tmp_path, cookie, chapter_title, strip_first_page)
# TODO: Add validation for the downloaded file (should be PDF)
Crest.get(url, max_redirects: 0, handle_errors: false, headers: headers) do |response|
# puts response.headers["Content-Type"]
content_type = response.headers["Content-Type"]
if content_type.is_a? String
if /html/.match content_type
puts response
response.body_io.each_line do |line|
if /Unable to construct chapter PDF/.match line
raise Muse::Dl::Errors::MuseCorruptPDF.new
end
end
end
end
File.open(tmp_pdf_file, "w") do |file|
IO.copy(response.body_io, file)
end
end
pdftk = Muse::Dl::Pdftk.new tmp_path
pdftk.strip_first_page tmp_pdf_file
if add_bookmark
# Run pdftk and add the bookmark to the file
pdftk.add_bookmark tmp_pdf_file, chapter_title
end
# Now we can move the file to the proper PDF filename
File.rename tmp_pdf_file, final_pdf_file
puts "Downloaded #{chapter_id}"
end
def self.get_info(url : String) : Muse::Dl::Thing | Nil
match = /https:\/\/muse.jhu.edu\/(book|journal)\/(\d+)/.match url
def self.save_article(tmp_path : String, article_id : String, cookie : String | Nil = nil, article_title = nil, strip_first_page = true)
file_name = article_file_name article_id, tmp_path
url = "https://muse.jhu.edu/article/#{article_id}/pdf"
referer = "https://muse.jhu.edu/article/#{article_id}"
save_url(url, referer, file_name, tmp_path, cookie, article_title, strip_first_page)
end
def self.get_info(url : String)
match = /https:\/\/muse.jhu.edu\/(book|journal|issue|article)\/(\d+)/.match url
if match
begin
response = Crest.get(url).to_s
@ -84,12 +136,16 @@ module Muse::Dl
return Muse::Dl::Book.new response
when "journal"
return Muse::Dl::Journal.new response
when "issue"
return Muse::Dl::Issue.new match[2], response
when "article"
return Muse::Dl::Article.new match[2]
end
rescue ex : Crest::NotFound
raise Muse::Dl::Errors::InvalidLink.new
raise Muse::Dl::Errors::InvalidLink.new("Error - could not download url: #{url}")
end
else
raise Muse::Dl::Errors::InvalidLink.new
raise Muse::Dl::Errors::InvalidLink.new("Error - url does not match expected pattern: #{url}")
end
end
end

View File

@ -34,6 +34,18 @@ module Muse::Dl
myhtml.css("#book_about_info .title").map(&.inner_text).to_a[0].strip
end
def self.issue_title(myhtml : Myhtml::Parser)
begin
myhtml.css(".card_text .title").map(&.inner_text).to_a[0].strip
rescue
nil
end
end
def self.journal_title(myhtml : Myhtml::Parser)
myhtml.css("#journal_about_info .title").map(&.inner_text).to_a[0].strip
end
def self.author(myhtml : Myhtml::Parser)
myhtml.css("#book_about_info .author").map(&.inner_text).to_a[0].strip.gsub("<BR>", ", ").gsub("\n", " ")
end
@ -50,9 +62,13 @@ module Muse::Dl
myhtml.css("#book_about_info .pub a").map(&.inner_text).to_a[0].strip
end
def self.journal_publisher(myhtml : Myhtml::Parser)
myhtml.css(".card_publisher a").map(&.inner_text).to_a[0].strip
end
def self.summary(myhtml : Myhtml::Parser)
begin
return myhtml.css("#book_about_info .card_summary").map(&.inner_text).to_a[0].strip
return myhtml.css(".card_summary").map(&.inner_text).to_a[0].strip
rescue e : Exception
STDERR.puts "Could not fetch summary"
return "NA"

97
src/issue.cr Normal file
View File

@ -0,0 +1,97 @@
"./thing.cr"
require "./fetch.cr"
require "./article.cr"
module Muse::Dl
class Issue
getter id : String,
title : String | Nil,
articles : Array(Muse::Dl::Article),
url : String,
summary : String | Nil,
publisher : String | Nil,
info : Hash(String, String),
volume : String | Nil,
number : String | Nil,
date : String | Nil,
journal_title : String | Nil
setter :journal_title
def initialize(id : String, response : String | Nil = nil)
@id = id
@url = "https://muse.jhu.edu/issue/#{id}"
@articles = [] of Muse::Dl::Article
parse(response) if response
@info = Hash(String, String).new
end
def open_access
if @info.has_key? "Open Access"
return @info["Open Access"] == "Yes"
end
false
end
def parse
html = Crest.get(@url).to_s
parse(html)
end
def parse(html : String)
h = Myhtml::Parser.new html
@info = InfoParser.infobox(h)
@title = InfoParser.issue_title(h)
@summary = InfoParser.summary(h)
@publisher = InfoParser.journal_publisher(h)
parse_title
parse_contents(h)
end
def parse_title
t = @title
unless t.nil?
@volume = /Volume (\d+)/.match(t).try &.[1]
@number = /Number (\d+)/.match(t).try &.[1]
@number = /Issue (\d+)/.match(t).try &.[1] unless @number
@date = /((January|February|March|April|May|June|July|August|September|October|November|December|Sring|Winter|Fall|Summer) (\d+))/.match(t).try &.[1]
@date = /(\d{4})/.match(t).try &.[1] unless @date
end
end
def parse_contents(myhtml : Myhtml::Parser)
unless @journal_title
journal_title_a = myhtml.css("#journal_banner_title a").first
if journal_title_a
@journal_title = journal_title_a.inner_text
end
end
myhtml.css(".articles_list_text ol").each do |ol|
link = ol.css("li.title a").first
title = link.inner_text
pages = ol.css("li.pg")
if pages.size > 0
p = pages.first.try &.inner_text
matches = /(\d+)-(\d+)/.match p
if matches
start_page = matches[1].to_i
end_page = matches[2].to_i
end
end
ol.css("a").each do |l|
url = l.attribute_by("href").to_s
matches = /\/article\/(\d+)\/pdf/.match url
if matches
a = Muse::Dl::Article.new matches[1]
a.title = title
a.start_page = start_page if start_page
a.end_page = end_page if end_page
@articles.push a
end
end
end
end
end
end

View File

@ -1,6 +1,44 @@
require "./thing.cr"
require "./infoparser.cr"
require "./issue.cr"
module Muse::Dl
class Journal < Muse::Dl::Thing
class Journal
getter :info, :summary, :publisher, :issues, :title
@info = Hash(String, String).new
@summary : String
@publisher : String
@issues = [] of Muse::Dl::Issue
@title : String
private getter :h
def initialize(html)
@h = Myhtml::Parser.new html
@info = InfoParser.infobox(h)
@summary = InfoParser.summary(h)
@publisher = InfoParser.journal_publisher(h)
@title = InfoParser.journal_title(h)
parse_volumes(h)
end
def open_access
if @info.has_key? "Open Access"
return @info["Open Access"] == "Yes"
end
false
end
def parse_volumes(myhtml : Myhtml::Parser)
myhtml.css("#available_issues_list_text a").each do |a|
link = a.attribute_by("href").to_s
matches = /\/issue\/(\d+)/.match link
if matches
issue = Muse::Dl::Issue.new matches[1]
issue.journal_title = @title
@issues.push issue
end
end
end
end
end

View File

@ -4,16 +4,23 @@ require "./fetch.cr"
require "./book.cr"
require "./journal.cr"
require "./util.cr"
require "file_utils"
module Muse::Dl
VERSION = "1.1.0"
VERSION = "1.3.1"
class Main
def self.dl(parser : Parser)
url = parser.url
puts "Downloading #{url}"
thing = Fetch.get_info(url) if url
return unless thing
if (thing.open_access) && (parser.skip_oa)
STDERR.puts "Skipping #{url}, available under Open Access"
return
end
if thing.is_a? Muse::Dl::Book
unless thing.formats.includes? :pdf
STDERR.puts "Book not available in PDF format, skipping: #{url}"
@ -24,34 +31,30 @@ module Muse::Dl
# If file exists and we can't clobber
if File.exists?(parser.output) && parser.clobber == false
STDERR.puts "File already exists: #{parser.output}"
STDERR.puts "Skipping #{url}, File already exists: #{parser.output}"
return
end
temp_stitched_file = nil
pdf_builder = Pdftk.new(parser.tmp)
unless parser.input_pdf
# Save each chapter
thing.chapters.each do |chapter|
begin
Fetch.save_chapter(parser.tmp, chapter[0], chapter[1], parser.cookie, parser.bookmarks)
rescue e : Muse::Dl::Errors::MuseCorruptPDF
STDERR.puts "Got a 'Unable to construct chapter PDF' error from MUSE, skipping: #{url}"
return
end
# Save each chapter
thing.chapters.each do |chapter|
begin
Fetch.save_chapter(parser.tmp, chapter[0], chapter[1], parser.cookie, parser.bookmarks, parser.strip_first)
rescue e : Muse::Dl::Errors::MuseCorruptPDF
STDERR.puts "Got a 'Unable to construct chapter PDF' error from MUSE, skipping: #{url}"
return
end
chapter_ids = thing.chapters.map { |c| c[0] }
# Stitch the PDFs together
temp_stitched_file = pdf_builder.stitch chapter_ids
pdf_builder.add_metadata(temp_stitched_file, parser.output, thing)
else
x = parser.input_pdf
pdf_builder.add_metadata(File.open(x), parser.output, thing) if x
end
chapter_ids = thing.chapters.map { |c| c[0] }
# Stitch the PDFs together
temp_stitched_file = pdf_builder.stitch chapter_ids
pdf_builder.add_metadata(temp_stitched_file, parser.output, thing)
temp_stitched_file.delete if temp_stitched_file
puts "Saved final output to #{parser.output}"
puts "--dont-strip-first-page was on. Please validate PDF file for any errors." unless parser.strip_first
puts "DL: #{url}. Saved final output to #{parser.output}"
# Cleanup the chapter files
if parser.cleanup
@ -59,20 +62,97 @@ module Muse::Dl
Fetch.cleanup(parser.tmp, c[0])
end
end
elsif thing.is_a? Muse::Dl::Article
# No bookmarks are needed since this is just a single article PDF
begin
Fetch.save_article(parser.tmp, thing.id, parser.cookie, nil, parser.strip_first)
rescue e : Muse::Dl::Errors::MuseCorruptPDF
STDERR.puts "Got a 'Unable to construct chapter PDF' error from MUSE, skipping: #{url}"
return
end
# TODO: Move this code elsewhere
source = Fetch.article_file_name(thing.id, parser.tmp)
destination = "article-#{thing.id}.pdf"
# Needed because of https://github.com/crystal-lang/crystal/issues/7777
FileUtils.cp source, destination
FileUtils.rm source if parser.cleanup
elsif thing.is_a? Muse::Dl::Issue
# Will have no effect if parser has a custom title
parser.force_set_output Util.slug_filename "#{thing.journal_title} - #{thing.title}.pdf"
# If file exists and we can't clobber
if File.exists?(parser.output) && parser.clobber == false
STDERR.puts "Skipping #{url}, File already exists: #{parser.output}"
return
end
temp_stitched_file = nil
pdf_builder = Pdftk.new(parser.tmp)
thing.articles.each do |article|
begin
Fetch.save_article(parser.tmp, article.id, parser.cookie, article.title, parser.strip_first)
rescue e : Muse::Dl::Errors::MuseCorruptPDF
STDERR.puts "Got a 'Unable to construct chapter PDF' error from MUSE, skipping: #{url}"
return
end
end
article_ids = thing.articles.map { |a| a.id }
# Stitch the PDFs together
temp_stitched_file = pdf_builder.stitch_articles article_ids
pdf_builder.add_metadata(temp_stitched_file, parser.output, thing)
# temp_stitched_file.delete if temp_stitched_file
puts "--dont-strip-first-page was on. Please validate PDF file for any errors." unless parser.strip_first
puts "DL: #{url}. Saved final output to #{parser.output}"
# Cleanup the issue files
if parser.cleanup
thing.articles.each do |a|
Fetch.cleanup_articles(parser.tmp, a.id)
end
end
elsif thing.is_a? Muse::Dl::Journal
thing.issues.each do |issue|
begin
# Update the issue
issue.parse
parser.url = issue.url
Main.dl parser
rescue e
puts e.message
puts "Faced an exception with previous issue, continuing"
end
end
end
end
def self.run(args : Array(String))
parser = Parser.new(args)
delay_secs = 1
input_list = parser.input_list
if input_list
File.each_line input_list do |url|
# TODO: Change this to nil
parser.reset_output_file
parser.url = url.strip
# Ask the download process to not quit the process, and return instead
Main.dl parser
begin
# TODO: Change this to nil
parser.reset_output_file
parser.url = url.strip
# Ask the download process to not quit the process, and return instead
Main.dl parser
if delay_secs >= 2
delay_secs /= 2
end
rescue ex
puts ex.message
puts ex.backtrace.join("\n ")
puts "Error. Skipping book: #{url}. Waiting for #{delay_secs} seconds before continuing."
sleep(delay_secs)
if delay_secs < 256
delay_secs *= 2
end
end
end
elsif parser.url
Main.dl parser

View File

@ -6,16 +6,19 @@ module Muse::Dl
@bookmarks = true
@tmp : String
@cleanup = true
# Whether to strip the first page
@strip_first = true
@output = DEFAULT_FILE_NAME
@url : String | Nil
@input_pdf : String | Nil
@clobber = false
@input_list : String | Nil
@cookie : String | Nil
@h : Bool | Nil
@skip_oa = false
DEFAULT_FILE_NAME = "tempfilename.pdf"
getter :bookmarks, :tmp, :cleanup, :output, :url, :input_pdf, :clobber, :input_list, :cookie
getter :bookmarks, :tmp, :cleanup, :output, :url, :clobber, :input_list, :cookie, :strip_first, :skip_oa
setter :url
# Update the output filename unless we have a custom one passed
@ -23,6 +26,10 @@ module Muse::Dl
@output = output_file unless @output != DEFAULT_FILE_NAME
end
def force_set_output(output_file : String)
@output = output_file
end
def reset_output_file
@output = DEFAULT_FILE_NAME
end
@ -38,7 +45,6 @@ module Muse::Dl
def initialize(arg : Array(String) = [] of String)
@tmp = Dir.tempdir
@input_pdf = nil
parser = OptionParser.new
parser.banner = <<-EOT
@ -48,23 +54,25 @@ module Muse::Dl
INPUT_FILE: Path to a file containing a list of links
EOT
parser.on(long_flag = "--no-cleanup", description = "Don't cleanup temporary files") { @cleanup = false }
parser.on(long_flag = "--tmp-dir PATH", description = "Temporary Directory to use") { |path| @tmp = path }
parser.on(long_flag = "--output FILE", description = "Output Filename") { |file| @output = file }
parser.on(long_flag = "--no-bookmarks", description = "Don't add bookmarks in the PDF") { @bookmarks = false }
parser.on(long_flag = "--input-pdf INPUT", description = "Input Stitched PDF. Will not download anything") { |input| @input_pdf = input }
parser.on(long_flag = "--clobber", description = "Overwrite the output file, if it already exists. Not compatible with input-pdf") { @clobber = true }
parser.on(long_flag = "--clobber", description = "Overwrite the output file, if it already exists.") { @clobber = true }
parser.on(long_flag = "--dont-strip-first-page", description = "Disables first page from being stripped. Use carefully") { @strip_first = false }
parser.on(long_flag = "--cookie COOKIE", description = "Cookie-header") { |cookie| @cookie = cookie }
parser.on("-h", "--help", "Show this help") { puts parser }
parser.on(long_flag = "--skip-open-access", description = "Don't download open access content") { @skip_oa = true }
parser.on("-h", "--help", "Show this help") { @h = true; puts parser }
parser.unknown_args do |args|
if args.size != 1
puts parser
# Prevent showing helptext twice
puts parser unless @h
exit 1
end
if File.exists? args[0]
@input_list = args[0]
@input_pdf = nil
else
@url = args[0]
end

View File

@ -28,14 +28,23 @@ module Muse::Dl
def execute(args : Array(String))
binary = @binary
if binary
Process.run(binary, args)
status = Process.run(binary, args, output: STDOUT, error: STDERR)
if !status.success?
puts "pdftk command failed: #{binary} #{args.join(" ")}"
end
return status.success?
end
end
def strip_first_page(input_file : String)
output_pdf = File.tempfile("muse-dl-temp", ".pdf")
execute [input_file, "cat", "2-end", "output", output_pdf.path]
File.rename output_pdf.path, input_file
is_success = execute [input_file, "cat", "2-end", "output", output_pdf.path]
if is_success
File.rename output_pdf.path, input_file
else
puts ("Error stripping first page of chapter. Maybe try using --dont-strip-first-page")
exit 1
end
end
def add_bookmark(input_file : String, title : String)
@ -48,16 +57,19 @@ module Muse::Dl
BookmarkPageNumber: 1
END
File.write(bookmark_text_file.path, bookmark_text)
execute [input_file, "update_info", bookmark_text_file.path, "output", output_pdf.path]
is_success = execute [input_file, "update_info", bookmark_text_file.path, "output", output_pdf.path]
# Cleanup
bookmark_text_file.delete
File.rename output_pdf.path, input_file
if is_success
File.rename output_pdf.path, input_file
else
raise Muse::Dl::Errors::PDFOperationError.new("Error adding bookmark metadata to chapter.")
end
end
def add_metadata(input_file : File, output_file : String, book : Book)
# First we have to dump the current metadata
metadata_text_file = File.tempfile("muse-dl-metadata-tmp", ".txt")
keywords = "Publisher:#{book.publisher}, Published:#{book.date}"
# Known Info keys, if they are present
@ -67,43 +79,94 @@ module Muse::Dl
end
end
text = <<-EOT
metadata_text = gen_metadata(book.title, keywords, book.summary.gsub(/\n\s+/, " "), book.author)
write_metadata(input_file, output_file, metadata_text)
end
def gen_metadata(title : String, keywords : String, subject : String, author : String | Nil = nil)
metadata = <<-EOT
InfoBegin
InfoKey: Creator
InfoValue: Project MUSE (https://muse.jhu.edu/)
InfoValue:
InfoBegin
InfoKey: Producer
InfoValue: Muse-DL/#{Muse::Dl::VERSION}
InfoValue:
InfoBegin
InfoKey: Title
InfoValue: #{book.title}
InfoValue: #{title}
InfoBegin
InfoKey: Keywords
InfoValue: #{keywords}
InfoBegin
InfoKey: Author
InfoValue: #{book.author}
InfoBegin
InfoKey: Subject
InfoValue: #{book.summary.gsub(/\n\s+/, " ")}
InfoValue: #{subject}
InfoBegin
InfoKey: ModDate
InfoValue:
InfoBegin
InfoKey: CreationDate
InfoValue:
EOT
unless author.nil?
metadata += <<-EOT
InfoBegin
InfoKey: Author
InfoValue: #{author}
EOT
end
return metadata
end
def write_metadata(input_file : File, output_file : String, text)
metadata_text_file = File.tempfile("muse-dl-metadata-tmp", ".txt")
File.write(metadata_text_file.path, text)
execute [input_file.path, "update_info_utf8", metadata_text_file.path, "output", output_file]
is_success = execute [input_file.path, "update_info_utf8", metadata_text_file.path, "output", output_file]
if !is_success
raise Muse::Dl::Errors::PDFOperationError.new("Error adding metadata to book.")
end
metadata_text_file.delete
end
def add_metadata(input_file : File, output_file : String, issue : Issue)
# First we have to dump the current metadata
metadata_text_file = File.tempfile("muse-dl-metadata-tmp", ".txt")
keywords = "Journal:#{issue.journal_title}, Published:#{issue.date},Volume:#{issue.volume},Number:#{issue.number}"
["ISSN", "Print ISSN", "DOI", "Language", "Open Access"].each do |label|
if issue.info.has_key? label
keywords += ", #{label}:#{issue.info[label]}"
end
end
# TODO: Move this to Issue class
s = issue.summary
unless s.nil?
summary = s.gsub(/\n\s+/, " ")
else
summary = "NA"
end
t = issue.title
unless t.nil?
title = t
else
title = "NA"
end
# TODO: Add support for all authors in the PDF
metadata = gen_metadata(title, keywords, summary)
write_metadata(input_file, output_file, metadata)
end
def stitch(chapter_ids : Array(String))
output_file = File.tempfile("muse-dl-stitched-tmp", ".pdf")
# Do some sanity checks on each Chapter PDF
chapter_ids.each do |id|
raise Muse::Dl::Errors::MissingChapter.new unless File.exists? Fetch.chapter_file_name(id, @tmp_file_path)
raise Muse::Dl::Errors::MissingFile.new unless File.exists? Fetch.chapter_file_name(id, @tmp_file_path)
raise Muse::Dl::Errors::CorruptFile.new unless File.size(Fetch.chapter_file_name(id, @tmp_file_path)) > 0
end
@ -111,9 +174,35 @@ module Muse::Dl
chapter_files = chapter_ids.map { |id| Fetch.chapter_file_name(id, @tmp_file_path) }
args = chapter_files + ["cat", "output", output_file.path]
execute args
is_success = execute args
# TODO: Validate final file here
if !is_success
raise Muse::Dl::Errors::PDFOperationError.new("Error stitching chapters together.")
end
return output_file
end
# TODO: Merge with stitch
def stitch_articles(article_ids : Array(String))
output_file = File.tempfile("muse-dl-stitched-tmp", ".pdf")
# Do some sanity checks on each Chapter PDF
article_ids.each do |id|
raise Muse::Dl::Errors::MissingFile.new unless File.exists? Fetch.article_file_name(id, @tmp_file_path)
raise Muse::Dl::Errors::CorruptFile.new unless File.size(Fetch.article_file_name(id, @tmp_file_path)) > 0
end
# Now let's stitch them together
article_files = article_ids.map { |id| Fetch.article_file_name(id, @tmp_file_path) }
args = article_files + ["cat", "output", output_file.path]
is_success = execute args
# TODO: Validate final file here
if !is_success
puts args
raise Muse::Dl::Errors::PDFOperationError.new("Error stitching articles together.")
end
return output_file
end

View File

@ -19,6 +19,13 @@ module Muse::Dl
private getter :h
def open_access
if @info.has_key? "Open Access"
return @info["Open Access"] == "Yes"
end
false
end
def initialize(html : String)
@h = Myhtml::Parser.new html
@info = InfoParser.infobox(h)

View File

@ -2,7 +2,7 @@ module Muse::Dl
class Util
# Generates a safe filename
def self.slug_filename(input : String)
input.strip.tr("\u{202E}%$|:;/\t\r\n\\", "-")
input.strip.tr("\u{202E}%$|:;/\"\t\r\n\\", "-")
end
end
end