viernes, 6 de noviembre de 2015

SCN Vanity App

This post was originally posted on SCN Vanity App.


This one goes out to my friend and colleague Aaron Williams who gave the idea for this -;)

For this one, we're going to use Python and some nice libraries...

Libraries
pip install pandas #For Data Analysis and Data Frames
pip install mechanize #Headless Web Browser
pip install beatifulsoup4 #For Web Scrapping

Of course...we're going to use some Regular Expressions as well...but that's already included in Python :)

So, the basic idea is that we need to log into SCN using our Username and Password and then read the first page of our "content" folder only for blog posts...then we can continue reading the following pages by using a parameter that will load the next 20 blogs...

Now...and before you say something -:P This works (at least for me) only for the first 10 pages...because after that the HTML seems to be automatically generated...so there's nothing to get more data from -:( or maybe my blogs come from a long time ago...my first blog ever on SCN was written on February 17, 2006 Tasting the mix of PHP and SAP...

Anyway...let's see the source code -:D


SCN_Vanity_App.py
#coding= utf8

USR = 'YourUser'
PWD = 'YourPassword'

import sys
import re
import mechanize
from BeautifulSoup import BeautifulSoup
import pandas as pd

reload(sys)
sys.setdefaultencoding("iso-8859-1")

cookies = mechanize.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cookies)
br.set_handle_robots(False)

res = br.open("http://scn.sap.com/login.jspa")

br.select_form(nr=0)
br["j_username"] = USR
br["j_password"] = PWD
br.submit()

br.select_form(nr=0)
res = br.submit()

result = res.read()

author = re.search("username: \'.*",result)
author = re.sub('username: \'|\'|\,','',author.group(0))
displayname = re.search("displayName: \'.*",result)
displayname = re.sub('displayName: \'|\'|\,','',displayname.group(0))

j = 0
df = pd.DataFrame()

while(1==1):
 try:
  link = "http://scn.sap.com/people/%s/content?filterID="\
         "contentstatus[published]~objecttype~objecttype[blogpost]" %(author)
  
  if(j>0):
   link = "http://scn.sap.com/people/%s/content?filterID="\
   "contentstatus[published]~objecttype~objecttype[blogpost]&start=%j" %(author,str(j))
  
  j += 20
   
  res = br.open(link)

  Titles = []
  Likes = []
  Bookmarks = []
  Comments = []
  Views = []

  soup = BeautifulSoup(res.read()) 
  list_items = [list_item for list_item in soup.findAll('td',{'class':'j-td-title'})]
  if(len(list_items) == 0):
   break;
  for i in range(0, len(list_items)):
   title = re.search('[^<>]+(?=<)',str(list_items[i]))
   Titles.append(title.group(0))

  list_items = [list_item for list_item in soup.findAll('a',{'class':'j-meta-number'})]
  for i in range(0, len(list_items), 2):
   like = re.sub('<[^>]+>|in.*','',str(list_items[i]))
   bookmark = re.sub('<[^>]+>|in.*','',str(list_items[i+1]))
   Likes.append(int(like))
   Bookmarks.append(int(bookmark))

  list_items = [list_item for list_item in soup.findAll('span',{'title':'Replies'})]
  for i in range(0, len(list_items)):
   comment = re.sub('<[^>]+>|in.*','',str(list_items[i]))
   Comments.append(int(comment))

  list_items = [list_item for list_item in soup.findAll('span',{'title':'Views'})]
  for i in range(0, len(list_items)):
   views = re.sub('<[^>]+>|in.*','',str(list_items[i]))
   Views.append(int(views))

  for i in range(0, len(Titles)):
   df = df.append({'Title': Titles[i], 'Likes': Likes[i], 'Bookmarks': Bookmarks[i], 
                   'Comments': Comments[i], 'Views': Views[i]}, ignore_index=True)
 
 except:
  break

print("Welcome " + displayname + "\n")
sum_row = df[["Views"]].sum()
print("Total number of Views" + " ==> " + str(sum_row.values[0]))
sum_row = df[["Comments"]].sum()
print("Total number of Comments" + " ==> " + str(sum_row.values[0]))
sum_row = df[["Bookmarks"]].sum()
print("Total number of Bookmarks" + " ==> " + str(sum_row.values[0]))
sum_row = df[["Likes"]].sum()
print("Total number of Likes" + " ==> " + str(sum_row.values[0]))

print("\nTop 3 Blogs with most Views")
print("---------------------------")
df = df.sort_values(by=['Views'],ascending=[False])
for i in range(0, 3):
 print(df.iloc[[i]]['Title'].values[0] + " ==> " + str(df.iloc[[i]]['Views'].values[0]))
print("\nTop 3 Blogs with most Comments")
print("---------------------------")
df = df.sort_values(by=['Comments'],ascending=[False])
for i in range(0, 3):
 print(df.iloc[[i]]['Title'].values[0] + " ==> " + str(df.iloc[[i]]['Comments'].values[0]))
print("\nTop 3 Blogs with most Bookmarks")
print("---------------------------")
df = df.sort_values(by=['Bookmarks'],ascending=[False])
for i in range(0, 3):
 print(df.iloc[[i]]['Title'].values[0] + " ==> " + str(df.iloc[[i]]['Bookmarks'].values[0]))
print("\nTop 3 Blogs with most Bookmarks")
print("---------------------------")
df = df.sort_values(by=['Likes'],ascending=[False])
for i in range(0, 3):
 print(df.iloc[[i]]['Title'].values[0] + " ==> " + str(df.iloc[[i]]['Likes'].values[0]))


If we run this code, then we're going to have a nice report like this one -;)


Of course...it would look better with a nicer UI...but that's not my forte -:(  So...if anyone wants to pick the project and improve it...I would really appreciate it -;)

Greetings,

Blag.
Development Culture.

No hay comentarios: