Opened 5 months ago

Last modified 4 months ago

#13773 assigned Question/How do I...?

How to DL a list of file links?

Reported by: https://id.mayfirst.org/taboka Owned by: https://id.mayfirst.org/jamie
Priority: Medium Component: Tech
Keywords: Cc:
Sensitive: no

Description (last modified by https://id.mayfirst.org/taboka)

Just curious if anyone knows... I've been looking at various extensions for Chrome and none seem to work.

I have a page with a list of links to PDF files. I often track court cases and each case can have hundreds of files. They're posted on Ecourts. The particular URL with the links is at: https://iapps.courts.state.ny.us/webcivil/FCASeFiledDocsDetail?county_code=U1HFl8xdT1zH475OMtqwXg%3D%3D&txtIndexNo=tZetmFO9f4q%2Fu6CVXIP6ww%3D%3D&showMenu=no&isPreRji=N&civilCase=H8HGl5KB6OMZADQ_PLUS_BbHhOg%3D%3D

The html of the pages for each link is like this:

<tr valign=top>
<td><span class=smallfont>4&nbsp;<br><font color="red"></font></span></td>
<td nowrap><span class=smallfont>06/05/2018&nbsp;</span></td>
<td><span class=smallfont>
<a onclick="openPDF('https://iapps.courts.state.ny.us/fbem/DocumentDisplayServlet?documentId=3IUxJsFOSsnivRFS4uvf3A==&system=prod', '')">AFFIDAVIT OR AFFIRMATION IN SUPPORT</a>
</span></td>
<td><span class=smallfont>Affirmation of Aron M. Zimmerman in Support of Plaintiff's Motion for a Temporary Restraining Order and Preliminary Injunction&nbsp;</span></td>
<td><span class=smallfont>001&nbsp;</span></td>
<td><span class=smallfont>ARON M ZIMMERMAN&nbsp;<br><font color="green"></font></span></td>
</tr>

In this case there are 125 PDF files and I just don't want to save each one at a time.

I can't find a Chrome extension that works. I even looked at wget and curl in the terminal to see if there might be a way.

Any ideas?

Change History (7)

comment:1 Changed 5 months ago by https://id.mayfirst.org/taboka

  • Description modified (diff)

comment:2 Changed 5 months ago by https://id.mayfirst.org/jamie

Hi taboka - this is a fun challenge.

The link you sent has expired (and also is behind a captcha) so unfortunately we can't completely automate things.

But... you can:

  1. Go to the page you want and then click Save page as in your browser (Web page only, html)
  2. Name the file: web.html
  3. Open a terminal and navigate to the folder you saved the page
  4. Run:
    egrep -o 'https://iapps.courts.state.ny.us/fbem/DocumentDisplayServlet\?documentId=[a-zA-Z0-9=]+&system=prod' web.html
    

That should produce a clean list of all the PDF files.

If you want to download them all automatically it gets a bit tricky since their site is designed to make this hard (they seem to reject requests by wget and curl). But that is easy enough to get around:

mkdir -p pdfs && for link in $(egrep -o 'https://iapps.courts.state.ny.us/fbem/DocumentDisplayServlet\?documentId=[a-zA-Z0-9=]+&system=prod' web.html); do id=$(echo "$link" | egrep -o 'documentId=[a-zA-Z0-9]+' | sed "s/documentId=//") && wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" "$link" -O pdfs/${id}.pdf; done

This snippet:

  1. Creates a pdf directory if it doesn't already exist
  2. Iterates over all the links it finds in the web.html file.
  3. For each link it finds, it captures the documentId part.
  4. It uses wget (which masquerades as firefox) to download the PDF into the pdfs directory named after the id

comment:3 Changed 4 months ago by https://id.mayfirst.org/taboka

Thank you. I'll give it a go and get back to you. You're correct that the link for the list expires and it's behind a captcha. I've tried every chrome extension known to man, as well as played with wget and curl. I'm starting to look at uGet, but don't have high hopes.

FYI, the site in question, Ecourts, is where calendar information is kept for Supreme, Civil and Housing courts in NYC (mostly Manhattan). On occasion it also has the submitted documents in PDF form for Supreme Court.

comment:4 Changed 4 months ago by https://id.mayfirst.org/taboka

Well ... partial success, and partial failure.

I created the web.html file as instructed.

egrep -o 'https://iapps.courts.state.ny.us/fbem/DocumentDisplayServlet\?documentId=[a-zA-Z0-9=]+&system=prod' web.html

But the output ended up on the stdout, not in the file. I tried "| web.html" and "> web.html" but neither worked (so much for my limited understanding of shell programming). So I simply copied the terminal output into the web.html file.

Then ... I tried this

mkdir -p pdfs && for link in $(egrep -o 'https://iapps.courts.state.ny.us/fbem/DocumentDisplayServlet\?documentId=[a-zA-Z0-9=]+&system=prod' web.html); do id=$(echo "$link" | egrep -o 'documentId=[a-zA-Z0-9]+' | sed "s/documentId=//") && wget --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" "$link" -O pdfs/${id}.pdf; done

And I got successive iterations of:

--2018-06-09 19:57:54--  https://iapps.courts.state.ny.us/fbem/DocumentDisplayServlet?documentId=JXzD9bqk49AP7KVc5zzXTQ==&system=prod
Resolving iapps.courts.state.ny.us (iapps.courts.state.ny.us)... 207.29.128.73
Connecting to iapps.courts.state.ny.us (iapps.courts.state.ny.us)|207.29.128.73|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pdf]
Saving to: ‘pdfs/JXzD9bqk49AP7KVc5zzXTQ.pdf’

pdfs/JXzD9bqk49AP7KVc5zzXTQ     [  <=>                                        ]   3.83M  13.4MB/s   in 0.3s   

2018-06-09 19:57:55 (13.4 MB/s) - ‘pdfs/JXzD9bqk49AP7KVc5zzXTQ.pdf’ saved [4016496]

But there are two issues:

  1. It only saved 60 out of 125 files. It wasn't the first 60, or the last 60, it just skipped a few along the way. Still that's at least half.
  1. The filenames have no correlation to what they should be. I get

"JXzD9bqk49AP7KVc5zzXTQ.pdf"

instead of

"451031_2018_THE_CITY_OF_NEW_YORK_v_THE_CITY_OF_NEW_YORKSUMMONS_COMPLAINT_1.pdf"

meaning I'll need to reconstruct the file names. Not sure if that will be quicker than just downloading all 125 files.

But I suppose that's better than nothing :)

I do appreciate your help on this.

BTW, I just remembered, this is from Ecourts, but there's another system where one can access the same files, SCROLL. Maybe I'll have better success there.

Thanks

comment:5 Changed 4 months ago by https://id.mayfirst.org/taboka

Just tried the SCROLL system using the same commands, and got the same partial results.

comment:6 Changed 4 months ago by https://id.mayfirst.org/jamie

Hi taboka,

The first command is working as expected - it really was just a test to make sure the grep pattern is working. It should just print out a bunch of URLs to standard out.

And, the PDF names - well, that's the best I can do without a much more complicated parsing project (the names of the PDFs are based on the document ID number which I realize is not all that useful to you).

I'm not sure why you only randomly get 60 our of the 120 though. Can you attach the web.html file you created to this ticket? I'd like to see if the problem is that the grep command is only finding 60 of them or if it finds all of them but the download is failing for half of them for some reason.

comment:7 Changed 4 months ago by https://id.mayfirst.org/jamie

  • Owner set to https://id.mayfirst.org/jamie
  • Status changed from new to assigned

Please login to add comments to this ticket.

Note: See TracTickets for help on using tickets.