Python: Website Thumbnail Generation

By Xah Lee. Date:

On my website there are several projects that has the form of a photo gallery. Typically, there are a set of HTML files, each one has 5 to 10 inline images. I need to generate thumbnail for each image that has a existing inline link.

For example, you can goto the following website and see the result of using this script:

To create a thumbnail image, i use the command line software ImageMagick [see ImageMagick Tutorial]. However, i only want to generate thumbnails based on images that have a existing inline link. So, i'll essentially have to parse these html files to find which image i need thumbnail first. Also, the images are located in directories, not all flatly in one directory. I need the thumbnails created in a path that mirrors their file path.

So, for example, suppose “<img src="_p/Gustave_Dore/Dore_red_ridinghood.png">” is a inline link. Then, the thumbnail for that image will be at “thumbnail/_p/Gustave_Dore/Dore_red_ridinghood.png”.

Solution

# -*- coding: utf-8 -*-
# Python

# © 2006-04 by Xah Lee, ∑ http://xahlee.org/, last mod: 2006-11

# Given a webite gallery of photos with hundreds of photos, i want to generate a thumbnail page so that viewers can get a bird's eye's view images.

# Technically:
# Given a dir: e.g. /Users/xah/web/Periodic_dosage_dir/lanci/
# This dir has many html files inside it, some are in sub directories.
# Each html file has many inline images (<img src="...">). (all inline images are local files)
# print out a string with the thumbnail images as inline images and linked to the file that contained the inline of the origila file. That is, if f.html contains inline image 1.png and 2.png, then the outptu should contain a line: like this: <a href="f.html"><img src="thumbnail_relative_path/1.png"><img src="thumbnail_relative_path/2.png"></a>

# • The goal is to create thumbnail images of all the inline images in all html files under that dir, and print a output that can serve as the index of thumbnails.
# • These thumbnail images destination can be specified, unrelated to the given dir.
# • Thumbnail images must preserve the dir structure they are in. For example, if a inline image's full path is /a/b/c/d/1.img, and the a root is given as /a/b, then the thumbnail image's path must retain the c/d, as sud bir under the specified thumbnail destination.
# • if the inline image's size is smaller than a certain given size (specified as area), then skip it.


# Note: online inline images in a html file will be considered for thumbnail. Any other images in the given dir or as linked images should be ignored.


###############################################
# User Inputs

# path where html files and images are at. e.g. /a/b/c/d
inPath= '/Users/xah/uci-server/vmm/Polyhedra' # no trailing slash

# the value is equal or part of inPath. The thumbnails will preserve dir structures. If a image is at  /a/b/c/d/e/f/1.png, and rootDir is /a/b/c, then the thumbnail will be at /x/y/d/e/f/1.png
rootDir= '/Users/xah/uci-server/vmm/Polyhedra' # no trailing slash

# the destination path of thumbanil images. It will be created. Existing things will be over-written.  e.g. /x/y
thumbnailDir= '/Users/xah/uci-server/vmm/Polyhedra/tn' # no trailing slash

# thumbnail size
thumbnailSizeArea = 150 * 150

# if a image is smaller than this area, don't gen thumbnail for it.
minArea = 100*100

# if True, all thumbnails will be in jpg format. Otherwise, it's the same on the original format.
# bw png files will look lousy when shrinked. That's why.
jpgOnlyThumbnails = False # True or False

# imageMagic 'identify' program path
identify = r'/sw/bin/identify'
convert = r'/sw/bin/convert'

# depth of nested dir to dive into.
minLevel=1; # files and dirs of mydir are level 1.
maxLevel=2; # inclusive

##############################
import re, subprocess, os.path, sys

############################### 
## functions

def scaleFactor(A,(w,h)):
    '''getInlineImg(A,(w,h)) returns a number s such that w*s*h*s==A. This is used for generating the scaling factor of a image with a given desired thumbnail area A. The w and h are width and height of rectangle (image). The A is given size of thumbnail of the photo (as area). When the image is scaled by s in both dimensions, it will have desired size specified by area A as thumbnail.'''
    return (float(A)/float(w*h))**0.5

def getInlineImg(file_full_path):
    '''getInlineImg(html_file_full_path) returns a array that is a list of inline images. For example, it may return ['xx.jpg','../image.png']'''
    FF = open(file_full_path,'rb')
    txt_segs = re.split( re.compile(r'src',re.U|re.I), unicode(FF.read(),'utf-8'))
#    txt_segs = re.split( r'src', unicode(FF.read(),'utf-8'))
    txt_segs.pop(0)
    FF.close()
    linx=[]
    for linkBlock in txt_segs:
        matchResult = re.search(ur'\s*=\s*\"([^\"]+)\"', linkBlock,re.U)
        if matchResult: linx.append( matchResult.group(1).encode('utf-8') )
    return linx

def linkFullPath(dir,locallink):
   '''linkFullPath(dir, locallink) returns a string that is the full path to the local link. For example, linkFullPath('/Users/t/public_html/a/b', '../image/t.png') returns 'Users/t/public_html/a/image/t.png'. The returned result will not contain double slash or '../' string.'''
   result = dir + '/' + locallink
   result = re.sub(r'//+', r'/', result)
   while re.search(r'/[^\/]+\/\.\.', result): result = re.sub(r'/[^\/]+\/\.\.', '', result)
   return result

def buildThumbnails(dPath, fName, tbPath, rPath, areaA):
    u'''Generate thumbnail images. dPath is directory full path, and fName is a html file name that exists under it. The tbPath is the thumbnail images destination dir. The areaA is the thumbnail image size in terms of its area. This function will create thumbnail images in the tbPath. rPath is a root dir subset of dPath, used to build the dir structure for tbPath for each thumbnail.

For Example, if
dPath = '/Users/mary/Public/pictures'
fName = 'trip.html' (this exits under dPath)
tbPath = '/Users/mary/Public/thumbs'
rPath = '/Users/mary/Public' (must be a substring of dPath or equal to it.)
and trip.html contains <img ="Beijin/day1/img1.jpg">
then a thumbnail will be generated at
'/Users/mary/Public/thumbs/pictures/Beijin/day1/img1.jpg'

This func uses the imagemagick's shell command “convert” and “identify”, and assumes that both's path on the disk are set in the global vars “convert” and “identify”.'''
    # outline:
    # • Read in the file.
    # • Get the img paths from inline images tags, accumulate them into a list.
    # • For each image, find its dimension w and h.
    # • Generate the thumbnail image on disk.

    # Generate a list of image paths. Each element of imgPaths is a full path to a image.
    imgPaths=[]
    for im in filter(lambda x : (not x.startswith('http')) and (not x.endswith('icon_sum.gif')), getInlineImg(dPath + '/' + fName)): imgPaths.append (linkFullPath(dPath, im))
#    print dPath, fName, tbPath, rPath
#    print imgPaths

    # generate the imgPaths2 list. (Change the image path to the full sized image, if it exists. That is, if image ends in -s.jpg, find one without the '-s'.)
    imgPaths2=[]
    for oldPath in imgPaths:
        newPath=oldPath
        (dirName, fileName) = os.path.split(oldPath)
        (fileBaseName, fileExtension)=os.path.splitext(fileName)
        if(re.search(r'-s$',fileBaseName,re.U)):
            p2=os.path.join(dirName,fileBaseName[0:-2]) + fileExtension
            if os.path.exists(p2): newPath=p2
        imgPaths2.append(newPath)

    # generate the imgData list. Each element in imgData has the form [image full path, [width, height]].
    imgData=[]
    for imp in imgPaths2:
    #   DSCN2699m-s.JPG JPEG 307x230+0+0 DirectClass 8-bit 51.7k 0.0u 0:01
  # print "Identifying:", imp
        imgInfo=subprocess.Popen([identify, imp], stdout=subprocess.PIPE).communicate()[0]
        (width,height)=(imgInfo.split()[2]).split('x')
        height=height.split('+')[0]
        if int(width)*int(height) > minArea: imgData.append( [imp, [int(width), int(height)]])

    linkPath=(dPath+'/'+fName)[ len(rPath) + 1:]
    sys.stdout.write('<a href="' + linkPath + '">')

    # create the scaled image files in thumbnail dir. The dir structure is replicated.
    for imp in imgData:
        #print "Thumbnailing:", imp
        oriImgFullPath=imp[0]
        thumbnailRelativePath = oriImgFullPath[ len(rPath) + 1:]
        thumbnailFullPath = tbPath + '/' + thumbnailRelativePath

        if jpgOnlyThumbnails:
            (b,e)=os.path.splitext(thumbnailRelativePath)
            thumbnailRelativePath=b + '.jpg'
            (b,e)=os.path.splitext(thumbnailFullPath)
            thumbnailFullPath=b + '.jpg'
        #print 'r',thumbnailRelativePath
        #print 'f',thumbnailFullPath
        sf=scaleFactor(areaA,(imp[1][0],imp[1][1]))

        sys.stdout.write('<img src="' + thumbnailFullPath + '" alt="">')

  # make dirs to the thumbnail dir
        (dirName, fileName) = os.path.split(thumbnailFullPath)
        (fileBaseName, fileExtension)=os.path.splitext(fileName)
        #print "Creating thumbnail:", thumbnailFullPath
        try:
            os.makedirs(dirName,0775)
        except(OSError):
            pass

  # create thumbnail
#        print 'convert' +  ' -scale ' +str(round(sf*100,2)) + '% ' + oriImgFullPath + ' ' + thumbnailFullPath
        subprocess.Popen([convert,  '-scale', str(round(sf*100,2)) + '% -sharpen 1', oriImgFullPath, thumbnailFullPath] ).wait()

    print '</a>'

#################
# main

def dirHandler(dummy, curdir, filess):
   curdirLevel=len(re.split('/',curdir))-len(re.split('/',inPath))
   filessLevel=curdirLevel+1
   if minLevel <= filessLevel <= maxLevel:
      for child in filess:
          if re.search(r'\.html$',child,re.U) and os.path.isfile(curdir+'/'+child):
            print "processing:", curdir+'/'+child
            buildThumbnails(curdir,child,thumbnailDir,rootDir,thumbnailSizeArea)

while inPath[-1] == '/': inPath = inPath[0:-1] # get rid of trailing slash
os.path.walk(inPath, dirHandler, 'dummy')