Title: Web Scraping Date: 2021-11-12 23:43 Category: Projects

Web Scraping

Introduction & Purpose

This is about my project for web scraping using beautifulSoup. The goal of the work was to automate scraping Wikipedia for articles about all of the Presidents of the United States, running a basic sentiment analysis on each article, and report the results.

Note: the count of presidential profiles is 44 and not 45 because Grover Cleveland served two non-consecutive terms and so was the 22nd and 24th President.

Here is the Python code:

import requests, csv, json, re, unicodedata, nltk, operator 
from bs4 import BeautifulSoup
from datetime import datetime
from nltk.sentiment import SentimentIntensityAnalyzer
from statistics import mean

start_time = datetime.now()
print("start time: ", start_time.strftime("%H:%M:%S"))

presidents_list = ['George_Washington','John_Adams','Thomas_Jefferson','James_Madison','James_Monroe','John_Quincy_Adams',
'Andrew_Jackson','Martin_Van_Buren','William_Henry_Harrison','John_Tyler','James_K._Polk','Zachary_Taylor','Millard_Fillmore',
'Franklin_Pierce','James_Buchanan','Abraham_Lincoln','Andrew_Johnson','Ulysses_S._Grant','Rutherford_B._Hayes','James_A._Garfield',
'Chester_A._Arthur','Grover_Cleveland','William_McKinley','Theodore_Roosevelt','William_Howard_Taft','Woodrow_Wilson','Warren_G._Harding',
'Calvin_Coolidge','Herbert_Hoover','Franklin_D._Roosevelt','Harry_S._Truman','Dwight_D._Eisenhower','John_F._Kennedy','Lyndon_B._Johnson',
'Richard_Nixon','Gerald_Ford','Jimmy_Carter','Ronald_Reagan','George_H._W._Bush','Bill_Clinton','George_W._Bush','Barack_Obama',
'Donald_Trump','Joe_Biden']

json_path = 'output/jsons/'
text_path = 'output/texts/'

#--- Extract president profiles from wikipedia
extract_count = 0
for president in presidents_list:
    print('... extracting: ' + president)
    url = "https://en.wikipedia.org/wiki/" + president
    json_output = json_path + president + '.json'
    text_output = text_path + president + '.txt'

    # Get the raw HTML for the page
    html_content = requests.get(url).text

    # Parse the html content, extract text, and delete numeric references (eg., "[1]") from text.
    # soup = BeautifulSoup(html_content, "lxml")
    soup = BeautifulSoup(html_content,  "html.parser")
    wiki_page_text = ""
    regex_pattern = r"\[\d+\]"
    for text_block in soup.find_all("p"):
        wiki_page_text += re.sub(regex_pattern, r"", text_block.text)

    with open(json_output, 'w') as f:
        json.dump(soup.prettify(),f)
    with open(text_output, 'w') as ft:
        ft.write(unicodedata.normalize('NFKD', wiki_page_text).encode('ascii', 'ignore').decode())
    extract_count += 1
print(f'{extract_count} extracts completed')

#--- Run basic sentiment analysis on extracts using VADER (Valence Aware Dictionary and Sentiment Reasoner).
sia = SentimentIntensityAnalyzer()
presidents_all_scores = {}

for president in presidents_list:
    with open(text_path + president + '.txt') as f:
        president_texts = f.readlines()
    president_scores = [sia.polarity_scores(section)["compound"] for section in president_texts]
    presidents_all_scores[president] = mean(president_scores)

print('\nPresident compound scores unsorted: ', presidents_all_scores)
sorted_scores = dict( sorted(presidents_all_scores.items(), key=operator.itemgetter(1),reverse=True))
presidents_list_name_lengths = [len(name) for name in presidents_list]
longest_name = max(presidents_list_name_lengths)
print('\nDictionary in descending order by score: ')
for key, value in sorted_scores.items():
    print(f'{key:{longest_name}} : {value}')
with open('president_sentiment_scores.json', 'w') as f:
    json.dump(sorted_scores, f)

end_time = datetime.now()
print("\nend time: ", end_time.strftime("%H:%M:%S"))
print("run time: ", end_time - start_time)

Here is output of a run:

(base) C:\Users\6560\Desktop\My Online Portfolio\web scraping>python wiki_presidents.py
start time:  20:03:30
... extracting: George_Washington
... extracting: John_Adams
... extracting: Thomas_Jefferson
... extracting: James_Madison
... extracting: James_Monroe
... extracting: John_Quincy_Adams
... extracting: Andrew_Jackson
... extracting: Martin_Van_Buren
... extracting: William_Henry_Harrison
... extracting: John_Tyler
... extracting: James_K._Polk
... extracting: Zachary_Taylor
... extracting: Millard_Fillmore
... extracting: Franklin_Pierce
... extracting: James_Buchanan
... extracting: Abraham_Lincoln
... extracting: Andrew_Johnson
... extracting: Ulysses_S._Grant
... extracting: Rutherford_B._Hayes
... extracting: James_A._Garfield
... extracting: Chester_A._Arthur
... extracting: Grover_Cleveland
... extracting: William_McKinley
... extracting: Theodore_Roosevelt
... extracting: William_Howard_Taft
... extracting: Woodrow_Wilson
... extracting: Warren_G._Harding
... extracting: Calvin_Coolidge
... extracting: Herbert_Hoover
... extracting: Franklin_D._Roosevelt
... extracting: Harry_S._Truman
... extracting: Dwight_D._Eisenhower
... extracting: John_F._Kennedy
... extracting: Lyndon_B._Johnson
... extracting: Richard_Nixon
... extracting: Gerald_Ford
... extracting: Jimmy_Carter
... extracting: Ronald_Reagan
... extracting: George_H._W._Bush
... extracting: Bill_Clinton
... extracting: George_W._Bush
... extracting: Barack_Obama
... extracting: Donald_Trump
... extracting: Joe_Biden
44 extracts completed

President compound scores unsorted:  {'George_Washington': -0.05613117647058823, 'John_Adams': 0.1264099173553719, 'Thomas_Jefferson': 0.10735350318471337, 'James_Madison': 0.1077625, 'James_Monroe': 0.03077260273972603, 'John_Quincy_Adams': 0.3375173913043478, 'Andrew_Jackson': -0.07889677419354839, 'Martin_Van_Buren': 0.15705, 'William_Henry_Harrison': -0.05758481012658228, 'John_Tyler': 0.08182702702702703, 'James_K._Polk': 0.16884876033057852, 'Zachary_Taylor': -0.13828045977011494, 'Millard_Fillmore': 0.18594105263157895, 'Franklin_Pierce': 0.059174712643678164, 'James_Buchanan': 0.05746621621621622, 'Abraham_Lincoln': 0.007139455782312922, 'Andrew_Johnson': -0.021845283018867925, 'Ulysses_S._Grant': 0.2470170068027211, 'Rutherford_B._Hayes': 0.10601111111111111, 'James_A._Garfield': 0.20889684210526316, 'Chester_A._Arthur': 0.20352727272727272, 'Grover_Cleveland': 0.1795375, 'William_McKinley': 0.14341574074074073, 'Theodore_Roosevelt': 0.24363509933774835, 'William_Howard_Taft': 0.22496048387096773, 'Woodrow_Wilson': 0.17885, 'Warren_G._Harding': 0.16148582089552238, 'Calvin_Coolidge': 0.3078324324324324, 'Herbert_Hoover': 0.14755238095238096, 'Franklin_D._Roosevelt': 0.15459561403508773, 'Harry_S._Truman': 0.054730718954248365, 'Dwight_D._Eisenhower': 0.10567134502923976, 'John_F._Kennedy': 0.16221166666666667, 'Lyndon_B._Johnson': 0.031903571428571434, 'Richard_Nixon': 0.10814264705882352, 'Gerald_Ford': 0.20233407407407408, 'Jimmy_Carter': 0.1316712, 'Ronald_Reagan': 0.11568375, 'George_H._W._Bush': 0.26213736263736265, 'Bill_Clinton': 0.1240581560283688, 'George_W._Bush': 0.023395731707317072, 'Barack_Obama': 0.26406638655462183, 'Donald_Trump': -0.08687678571428571, 'Joe_Biden': 0.08967674418604651}

Dictionary in descending order by score:
 1| John_Quincy_Adams      : 0.3375173913043478
 2| Calvin_Coolidge        : 0.3078324324324324
 3| Barack_Obama           : 0.26406638655462183
 4| George_H._W._Bush      : 0.26213736263736265
 5| Ulysses_S._Grant       : 0.2470170068027211
 6| Theodore_Roosevelt     : 0.24363509933774835
 7| William_Howard_Taft    : 0.22496048387096773
 8| James_A._Garfield      : 0.20889684210526316
 9| Chester_A._Arthur      : 0.20352727272727272
10| Gerald_Ford            : 0.20233407407407408
11| Millard_Fillmore       : 0.18594105263157895
12| Grover_Cleveland       : 0.1795375
13| Woodrow_Wilson         : 0.17885
14| James_K._Polk          : 0.16884876033057852
15| John_F._Kennedy        : 0.16221166666666667
16| Warren_G._Harding      : 0.16148582089552238
17| Martin_Van_Buren       : 0.15705
18| Franklin_D._Roosevelt  : 0.15459561403508773
19| Herbert_Hoover         : 0.14755238095238096
20| William_McKinley       : 0.14341574074074073
21| Jimmy_Carter           : 0.1316712
22| John_Adams             : 0.1264099173553719
23| Bill_Clinton           : 0.1240581560283688
24| Ronald_Reagan          : 0.11568375
25| Richard_Nixon          : 0.10814264705882352
26| James_Madison          : 0.1077625
27| Thomas_Jefferson       : 0.10735350318471337
28| Rutherford_B._Hayes    : 0.10601111111111111
29| Dwight_D._Eisenhower   : 0.10567134502923976
30| Joe_Biden              : 0.08967674418604651
31| John_Tyler             : 0.08182702702702703
32| Franklin_Pierce        : 0.059174712643678164
33| James_Buchanan         : 0.05746621621621622
34| Harry_S._Truman        : 0.054730718954248365
35| Lyndon_B._Johnson      : 0.031903571428571434
36| James_Monroe           : 0.03077260273972603
37| George_W._Bush         : 0.023395731707317072
38| Abraham_Lincoln        : 0.007139455782312922
39| Andrew_Johnson         : -0.021845283018867925
40| George_Washington      : -0.05613117647058823
41| William_Henry_Harrison : -0.05758481012658228
42| Andrew_Jackson         : -0.07889677419354839
43| Donald_Trump           : -0.08687678571428571
44| Zachary_Taylor         : -0.13828045977011494

end time:  20:04:57
run time:  0:01:27.511238

(base) C:\Users\6560\Desktop\My Online Portfolio\web scraping>

Discussion

I was mainly interested in trying out webscraping and did not want to spend very much time on the sentiment analyser and so made the pragmatic decision to use VADER (Valence Aware Dictionary and Sentiment Reasoner). "Out of the box" VADER is tuned to processing Twitter messages, and so works best with shorter text. It provided a general estimate of favorability of profiles from just the few lines and so provided a lot for a small expenditure of code:

from nltk.sentiment import SentimentIntensityAnalyzer
...
sia = SentimentIntensityAnalyzer()
...
president_scores = [sia.polarity_scores(section)["compound"] for section in president_texts]

The results of the sentiment analysis, in order from most favorable to most unfavorable, had some points that were expected and some that were somewhat surprising. Seeing Obama rank very high in the list and Trump rank very low was expected, given that Wikipedia's model is to be an open source reference supported by anonymous volunteer contributors (https://en.wikipedia.org/wiki/Wikipedia:About) and, given that the base of contributors is broad, is in my opinion, rather grassroots and populist.

I was surprised to see Wilson rank higher than Jefferson, FDR, and others. I was very surprised to see George Washington and Abraham Lincoln rank so low. The surprises show that the sentiment analysis part of the process could greatly benefit from improvements, to be diplomatic and optimistic.