Thursday, March 24, 2011

Finding a string in Python

I want to find the URLs contained within a number of strings, probably about twenty, in a text file, which is basically a dump of the source code of a webpage.

What do I know about the strings?
  • They are all 132 characters long
  • They all begin with  <a href="http://freemusicarchive.org/music/download/
  • They all end with class="icn-arrow" title="Download"></a>
  • They all contain an alphanumeric sequence of 40 characters like this: ea1f8f96b14ef56f6eef47a3e1e74269c195f4c1
  • So I'm looking for 20 twenty strings of 132 characters that resemble this one:
  • <a href="http://freemusicarchive.org/music/download/ea1f8f96b14ef56f6eef47a3e1e74269c195f4c1 class="icn-arrow" title="Download"></a>
What do I know about the URL string?
Why am I trying to do this?

Ultimately, I want to create a BASH script that is populated with lines like this

So I'm trying to work out how to extract the URL from within the string and append it to a BASH script or text file.

Why would I do such a thing?

I'd like to automate the fetching of music files from freemusicarchive.org and I'd like to further my understanding of Python.

What problems do I foresee?

How do prevent clobber?  I don't want to download the same couple of files every day.  My gnome-terminal -x wget trick will only take me so far.
In the next stage I'd like to incorporate some kind of log or database of the URLs so that the program will only generate these download lines if the URL in question is not found in the log or database. 

What other ideas do I have?
  1. At the moment the files being downloaded just have alphanumeric strings as names.  However, opening them in a music player reveals that they are tagged MP3s, so it would be great to download the file and rename it according to its tags.
  2. It would also be nice to have a GUI to ask the user which genre of new music he'd like.  For myself I want to download everything, but that won't suit other users.

Alexander Garber
Director
Clockwork PC Open Source IT Solutions
"Informed decisions, real solutions"
Picture

T:  +61-3-8060-6651
M: +61-430-854-599
alex@clockworkpc.com.au

No comments:

Post a Comment