Python case study: leetcode scraper
It has been many years since last time I touched python. Things get very rusty. Recently, I have been practicing my algorithm skills on leetcode and I keep all my solutions in a github repo. I want my source files have consistency formatting shown below
/*
* [Source]
*
* https://leetcode.com/problems/same-tree/
*
* [Problem Description]
*
* Given two binary trees, write a function to check if they are equal or not.
*
* Two binary trees are considered equal if they are structurally identical
* and the nodes have the same value.
*
* [Companies]
*/
// Source code begins here ...
The overhead of adding this header comment can be quite large. So, I ask myself if there is a better way to make the whole process automated as much as possible. Python and its famous beautifulsoup library 1 immediately come into my mind.
In this post, I'll highlight some python usage appeared in the script, which costs me quite some time on googling. Please leave your comment if you find any non-pythonic usage. The code script is available here. I'll use 92. ReverseLinkedList II leetcode page as a working example to demonstrate the python techniques.
#!/usr/bin/env python3.6
# -*- coding: utf-8 -*-
The very first thing is "shebang". This is important for our task because the web page is often written in the unicode (i.e. mathematical symbols). This shebang will help us avoid unicode & ascii madness.
from bs4 import BeautifulSoup
import requests
import sys
We use a lot of libraries through import
. If I use import module, I have to use quantifier
for any module function call (i.e. sys.exit()
). By the contrast, I can directly
call the module function if I do from module import. This brings a question on
when to use which. Here, I want to quote the explanation from
Dive Into Python
When should you use from module import?
- If you will be accessing attributes and methods often and don't want to type the module name over and over, use "from module import".
- If you want to selectively import some attributes and methods but not others, use "from module import".
- If the module contains attributes or functions with the same name as ones in your module, you must use "import module" to avoid name conflicts.
The author makes extra remark: Use from module import * sparingly, because it makes it difficult to determine where a particular function or attribute came from, and that makes debugging and refactoring more difficult.
script, url = sys.argv
print('url is {:s}'.format(url))
I used to really like Python2.7 and not a big fan of Python3. However, with python2.7 EOS,
change must be made. the print
statement is how we do format printing in python3.
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "lxml")
Here, I use requests
library to fetch the content of the url
and then feed it into our BeautifulSoup
with parser lxml
.
The next step is to actual scrap the data from leetcode page. The first thing I do is to get the question title. Leetcode page has the following structure for the question title
<div class="question-title clearfix">
<div class="row">
<div class="col-lg-4 col-md-5 col-sm-6 col-sm-push-6 col-md-push-7 col-lg-push-8" id="widgets">
<div class="like-and-dislike">
<div id="question-like"></div>
</div>
<div class="add-to-list">
<div id="add-to-favorite"></div>
</div>
</div>
<div class="col-lg-8 col-md-7 col-sm-6 col-sm-pull-6 col-md-pull-5 col-lg-pull-4">
<h3>
92. Reverse Linked List II
</h3>
</div>
</div>
</div>
As you can see, the question title ("92. Reverse Linked List II") is wrapped around
by the <div>
tag with class name question-title
.
title_corp = soup.find_all("div", class_="question-title")
title_raw = title_corp[0].h3.get_text()
So, we invoke find_all
method from beautiful soup to find all the <div></div>
tags with class name question-title
.
Fortunately, question-title
class appears only once in the whole html page.
That allows us to directly access its using title_corp[0]
. In addition, as you can see
from html source code above, <h3></h3>
appears only once and it wraps our
problem title. So, we can directly access the content of <h3></h3>
tags by
title_corp[0].h3.get_text()
.
Note
find_all
returns a "ResultSet" object in beautifulSoup. This object contains
a set of tags that match with find_all
function argument criteria. In our case,
our criteria is <div></div>
tag with class name question-title
.
Now, once we have the title string, we want to process it into our desired form.
Our scraper script will go into the
leetcode directory of the shuati repo
and create the question directory with the format
"[question number]-[question title in mixed case with the first letter of each internal word capitalized]"
For example, "92. Reverse Linked List II" will lead to a directory
./leetcode/92-ReverseLinkedListII
. The source file name is similar to the
directory name: reverseLinkedListII.c
. That's what following code chunk tries to achieve
title_lines = title_raw.split('\n')
title_lines = list(filter(operator.methodcaller('strip'), title_lines))
title_rdy = title_lines[0].lstrip(' ').replace(".", "-").split(' ')
title = "".join(title_rdy)
path = "./leetcode/" + title
os.mkdir(path)
title_lines = title_raw.split('\n')
will split the whole text into a list
of strings with each string being a line of code. In our case, this will give
['', ' 92. Reverse Linked List II', ' ']
.
As you can see our result contains empty string, string with multiple leading
whitespaces, and string with only whitespaces. We need to do some cleanup to keep
only the question title. The first thing we do is to take out the empty string and
the string with only whitespaces. This is done by
title_lines = list(filter(operator.methodcaller('strip'), title_lines))
2.
filter
creates a list of elements for which a function (the 1st argument of filter
)
returns true. operator.methodcaller('strip')
uses
methodcaller,
which applies strip
function to each element of title_lines
. The function will
return true only when our string has some characters in it. This will lead to
[' 92. Reverse Linked List II']
.
Note
Here is an example of methodcaller
: After
f = methodcaller('name', 'foo', bar=1)
, the call f(b)
returns b.name('foo', bar=1)
.
In our case, filter
will apply operator.methodcaller('strip')
on title_lines
, which
is basically title_lines.strip()
.
Now, we will work on our title string.
title_rdy = title_lines[0].lstrip(' ').replace(".", "-").split(' ')
removes
leading whitespace (lstrip(' ')
) and replace .
with -
, and then split
our string into words: ['92-', 'Reverse', 'Linked', 'List', 'II']
. We are ready
to form our directory by join
the words together (title = "".join(title_rdy)
)
and get 92-ReverseLinkedListII
.
Our file name should look like reverseLinkedListII.c
. This invloves a use of
regular expression to get rid of 92-
and convert the first character of the rest
of string into lower case. The code is below
extension = ".c"
pat = re.compile(r"^(\d+)-")
m = re.search(pat, title)
filename=title[:m.start()] + title[m.end():]
filename=filename[0].lower() + filename[1:]
target = open(path+"/"+filename+extension, "w")
The regular expression is best illsutrated from a snippet taken from re library
>>> email = "tony@tiremove_thisger.net"
>>> m = re.search("remove_this", email)
>>> email[:m.start()] + email[m.end():]
'[email protected]'
^
matches the beginning of the string and \d
means numeric digits and +
means at least once appearance (of \d
). Just like official doc snippet above,
filename=title[:m.start()] + title[m.end():]
removes, for instance, 92-
and
leaves us ReverseLinkedListII
3. One thing to notice right now is that our
filename
has object type str
, which is immutable. This means that we cannot
edit the variable itself. filename=filename[0].lower() + filename[1:]
is
a typical way to handle immutable str
object, which, in our case, lower the
first character case and append it back to the rest of string.
The last point needs to notice is line = line.replace("\r", "").replace("\n", "")
,
which removes carriage return character (^M
) and linux newline character.
That's it for the leetcode scraper. This is actually the first scraper I have ever written. It is not as hard as I imagined. I think that's majorly because of the powerful python language and its libraries.
-
Here is a good tutorial on beautifulSoup. ↩
-
This line is found from this SO post. ↩
-
I do a quick summary of regular expression in python. ↩