A high-performance scraper built for an Upwork job that was awarded to someone else...
I came across an Upwork job offering for a freelancer to scrape all inmate information from the Alabama Department of Corrections website. Rather than just submitting a proposal with vague promises, I decided to solve the entire problem upfront.
I built and ran this scraper and generated the complete dataset the client requested. I attached the final Excel file to my proposal, effectively delivering the finished product before even being hired.
The client ended up hiring someone else (perhaps they never saw my application). So I open-sourced it.
The key to this approach was bypassing slow browser automation (like Selenium or Playwright) and instead reverse-engineering the website's underlying ASP.NET HTTP requests. This resulted in a massive performance gain.
| Method | Estimated Time to Complete | Actual Time to Complete |
|---|---|---|
| Browser Automation (e.g., Playwright) | ~ 15+ Hours |
N/A |
| Direct HTTP Requests (This Repo) | ~ 30 Minutes |
< 30 Minutes |
The initial collection of all 27,000+ unique inmate ID numbers (AIS #) completes in about 10 seconds.
The scraper is designed in two phases.
This script is responsible for finding every unique inmate AIS number on the site.
- It iterates through each letter of the alphabet (
atoz) as a search term for the "Last Name" field. - For each letter, it launches a parallel worker thread (up to 26 workers).
- Each worker performs the initial search and then rapidly paginates through all result pages, collecting every AIS number it finds.
- All unique AIS numbers are consolidated and saved to a checkpoint file:
ais_numbers_checkpoint.txt.
This phase is fast, typically finishing in under 15 seconds.
This script reads the AIS numbers from the checkpoint file and scrapes the detailed profile for each one.
- It loads the list of AIS numbers to be processed.
- It checks for an existing output CSV and resumes from where it left off.
- It uses a
ThreadPoolExecutorto process up to 50 inmates concurrently. - For each AIS number, a worker sends a direct POST request to fetch the inmate's details page.
- The HTML is parsed with BeautifulSoup to extract all relevant information, including demographics, incarceration history, sentences, aliases, and tattoos.
- The data is appended directly to the final CSV file,
alabama_inmates_database.csv.
The Alabama DOC website is built on ASP.NET, which uses a "ViewState" mechanism to maintain state between requests. A browser normally handles this automatically, but to do it with requests, I had to mimic that behavior.
Here was the process:
- Inspect Network Traffic: Using browser developer tools, I monitored the network requests made when performing a search.
- Identify Key Payloads: I discovered that any POST request (like a search or clicking "next page") required several hidden form inputs to be sent back to the server. These are:
__VIEWSTATE__VIEWSTATEGENERATOR__EVENTVALIDATION
- Mimic the Browser Flow: The scripts replicate the browser's actions:
- First, send a GET request to the search page to get an initial, valid set of
__VIEWSTATEtokens. - Extract these values from the HTML using BeautifulSoup.
- Construct a
datapayload for a POST request that includes these tokens along with our search term (e.g.,ctl00$MainContent$txtLName: 'a'). - For pagination or clicking on a record, the process is the same, but we also include an
__EVENTTARGETvalue that tells the server which button was "clicked".
- First, send a GET request to the search page to get an initial, valid set of
This direct-request method is orders of magnitude faster and uses significantly fewer resources than loading a full browser for each request.
- Python 3.13
pipfor installing packages
git clone https://github.com/vekoada/alabama-doc-scraper.git
cd alabama_inmates# For Unix/macOS
python3 -m venv venv
source venv/bin/activate
# For Windows
python -m venv venv
.\venv\Scripts\activatepip install pandas beautifulsoup4 requests lxmlThe process is a simple two-step execution.
Step 1: Collect all inmate AIS numbers.
python collect_ais.pyThis will create ais_numbers_checkpoint.txt.
Step 2: Process the AIS numbers to get inmate details.
python process_ais.pyThis will read the checkpoint file and create alabama_inmates_database_2.csv. You can run this command again to resume if it gets interrupted.
ais_numbers_checkpoint.txt: A text file containing one unique inmate AIS number per line.alabama_inmates_database.csv: The final dataset containing all scraped information.
This project is licensed under the MIT License.