The analysis is part of an assignment for a MOOC course, that is not referenced here due to course policy. The sources of the data used in the analysis is given as follows:

  • List of indicators of energy supply and renewable electricity production for the year 2013 can be found as an excel file at the United Nations Statistics Division’s website linked here. The energy data from the excel file Energy Indicators.xls, which is read into a DataFrame with the variable name of energy.

  • The data containing GDP of the countries from 1960 to 2015 can be found at the World Bank and is stored in the csv file world_bank.csv, and then read into the DataFrame GDP.

  • The ranking of countries based on their journal contributions for Energy Engineering and Power Technology can be found at Sciamgo Journal stored in the file scimagojr-3.xlsx, and then loaded into the DataFrame ScimEn.

The above three datasets: GDP, Energy, and ScimEn are merged into a new dataset (using the intersection of country names) using only the last 10 years (2006-2015) of GDP data and only the top 15 countries by Scimagojr ‘Rank’ (Rank 1 through 15).

The index of this DataFrame is the name of the country, and the columns are

['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations',
       'Citations per document', 'H index', 'Energy Supply',
       'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015'].

Functions answering specific queries are written below.

import pandas as pd
import numpy as np

This function reads the above three datasets, cleans them and joins them, then returns a DataFrame with 20 columns and top 15 entries sorted by the rank given by Scimago journal.

def get_data():
    def get_energy():
        energy = pd.read_excel('Energy Indicators.xls', skiprows = 18, 
                               skip_footer = 38, header = None)
        energy.drop(energy.columns[[0,2]], axis = 1, inplace = True)
        energy.columns = ['Country', 'Energy Supply', 
                          'Energy Supply per Capita', '% Renewable']
        energy = energy.replace('...', np.nan)
        energy['Energy Supply'] *= 1000000
        energy.set_index('Country', inplace = True)
        energy.rename(index = { "Republic of Korea": "South Korea",
        "United States of America": "United States",
        "United Kingdom of Great Britain and Northern Ireland": "United Kingdom",
        "China, Hong Kong Special Administrative Region": "Hong Kong"}, 
        inplace = True)
        dictnry = {ord("1"): None, ord("1"): None, ord("2"): None, 
                   ord("3"): None, ord("4"): None, ord("5"): None, 
                   ord("6"): None, ord("7"): None, ord("8"): None, 
                   ord("9"): None}
        index_list_energy = {c : c.split('(')[0].strip().translate(dictnry) 
                             for c in energy.index}
        energy.rename(index = index_list_energy, inplace = True)
        return energy
    def get_GDP():
        GDP = pd.read_csv('world_bank.csv', skiprows = 4)
        GDP['Country Name'].replace({"Korea, Rep.": "South Korea", 
                                    "Iran, Islamic Rep.": "Iran",
                                    "Hong Kong SAR, China": "Hong Kong"
                                    }, inplace = True)
        GDP.set_index('Country Name', inplace = True) = 'Country'
        columns_to_keep = GDP.columns[-10:]
        GDP = GDP[columns_to_keep]
        return GDP
    def get_ScimEn():
        ScimEn = pd.read_excel("scimagojr-3.xlsx")
        ScimEn.set_index('Country', inplace = True)
        return ScimEn
    energy, GDP, ScimEn = get_energy(), get_GDP(), get_ScimEn()
    ScimEn = ScimEn[ScimEn['Rank'] <= 15]
    d = pd.merge(pd.merge(ScimEn, energy, how = "inner", left_index = True, 
                right_index = True), GDP, how = "inner", left_index = True, 
                right_index = True)
    d = d.sort_values(by = "Rank")
    return d
print("Columns:\n", get_data().columns)
print("Indices:\n", get_data().index)
 Index(['Rank', 'Documents', 'Citable documents', 'Citations', 'Self-citations',
       'Citations per document', 'H index', 'Energy Supply',
       'Energy Supply per Capita', '% Renewable', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015'],
 Index(['China', 'United States', 'Japan', 'United Kingdom',
       'Russian Federation', 'Canada', 'Germany', 'India', 'France',
       'South Korea', 'Italy', 'Spain', 'Iran', 'Australia', 'Brazil'],
      dtype='object', name='Country')

From now onwards, only the top 15 countries by Scimagojr Rank (aka the DataFrame returned by get_data()) is used for analysis.

This function returns a Series named avgGDP giving the average GDP over the last 10 years for the 15 countries sorted in descending order.(Missing values are excluded from this calculation.)

def get_avgGDP(): 
    import pandas as pd
    Top15 = get_data()
    def avgGDP(row):
        data = row[-10:]
        return pd.Series({'avgGDP': data.mean()}) 
    avgGDP_series = Top15.apply(avgGDP, axis = 1).sort_values(by = "avgGDP",
                    ascending = False)['avgGDP']
    return avgGDP_series
United States         1.536434e+13
China                 6.348609e+12
Japan                 5.542208e+12
Germany               3.493025e+12
France                2.681725e+12
United Kingdom        2.487907e+12
Brazil                2.189794e+12
Italy                 2.120175e+12
India                 1.769297e+12
Canada                1.660647e+12
Russian Federation    1.565459e+12
Spain                 1.418078e+12
Australia             1.164043e+12
South Korea           1.106715e+12
Iran                  4.441558e+11
Name: avgGDP, dtype: float64

This function returns the change in GDP over the 10 year span for the country with the 6th largest average GDP.

def change_in_GDP():
    Top15 = get_data()
    avgGDP = get_avgGDP()
    idx = avgGDP.index[5]
    return Top15.loc[idx]['2015'] - Top15.loc[idx]['2006']

This function returns the mean Energy Supply per Capita.

def mean_EnergySupply():
    Top15 = get_data()
    return Top15['Energy Supply per Capita'].mean()

This function returns a tuple with the name of the country with the maximum % Renewable and the percentage.

def max_Renewable():
    Top15 = get_data()
    return (Top15['% Renewable'].idxmax(), Top15['% Renewable'].max())
('Brazil', 69.648030000000006)

This function returns a tuple with the name of the country that the maximum value for the ratio of Self-Citations to Total Citations and the ratio.

def ratio_Citations():
    Top15 = get_data()
    Top15["Ratio"] = Top15["Self-citations"]/Top15["Citations"]
    return (Top15["Ratio"].idxmax(), Top15["Ratio"].max())
('China', 0.68931261793894216)

This function returns the third most populous country according to the estimate of the population using Energy Supply and Energy Supply per capita.

def thirdPopulous_EnergySpply():
    Top15 = get_data()
    Top15['PopEst'] = Top15['Energy Supply'] / Top15['Energy Supply per Capita']
    Top15.sort_values(by = "PopEst", inplace = True, ascending = False)
    return Top15.index[2]
'United States'

This function returns the correlation between the number of citable documents per capita and the energy supply per capita, using the .corr() method (Pearson’s correlation).

def corr_Citations_Energy():
    Top15 = get_data()
    Top15['PopEst']=Top15['Energy Supply']/Top15['Energy Supply per Capita']
    Top15["Citable documents per person"]=Top15["Citable documents"]/Top15['PopEst']
    c = Top15[["Citable documents per person","Energy Supply per Capita"]].corr()
    return c.iloc[0,1]

This function returns a series named HighRenew that contains only the countries with % Renewable value at or above the median for all countries in the top 15. The index of this series is the country name sorted in ascending order of rank.

def get_HighRenew():
    Top15 = get_data()
    m = Top15["% Renewable"].median()
    Top15.sort_values(by = "Rank", inplace = True)
    Top15["HighRenew_TruthValue"] = (Top15["% Renewable"] >= m).astype(int)
    HighRenew = Top15[Top15["HighRenew_TruthValue"] == 1].loc[:,"% Renewable"]
    return HighRenew
China                 19.75491
Russian Federation    17.28868
Canada                61.94543
Germany               17.90153
France                17.02028
Italy                 33.66723
Spain                 37.96859
Brazil                69.64803
Name: % Renewable, dtype: float64

This function groups the countries by Continent and returns the dataframe with columns corresponding to the number of countries in each continent bin, and the sum, mean, and std deviation for the estimated population of each country.

def get_ContinentData():
    Top15 = get_data() 
    Top15['PopEst'] = Top15['Energy Supply']/Top15['Energy Supply per Capita']
    ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'South Korea':'Asia', 
                  'Brazil':'South America'}
    return Top15.groupby(by = ContinentDict)["PopEst"].agg([np.size, np.sum, 
                                                            np.mean, np.std])
size sum mean std
Asia 5.0 2.898666e+09 5.797333e+08 6.790979e+08
Australia 1.0 2.331602e+07 2.331602e+07 NaN
Europe 6.0 4.579297e+08 7.632161e+07 3.464767e+07
North America 2.0 3.528552e+08 1.764276e+08 1.996696e+08
South America 1.0 2.059153e+08 2.059153e+08 NaN

This function cuts % Renewable into 5 bins, group Top15 by the Continent, as well as these new % Renewable bins and finally returns a Series giving the number of countries in each of these groups with a MultiIndex of Continent and the bins for % Renewable. (Groups with no countries are not included in the returned series)

def bins_Renew():
    Top15 = get_data()
    bins = pd.cut(Top15["% Renewable"], 5)
    ContinentDict  = {'China':'Asia', 
                  'United States':'North America', 
                  'United Kingdom':'Europe', 
                  'Russian Federation':'Europe', 
                  'Canada':'North America', 
                  'South Korea':'Asia', 
                  'Brazil':'South America'}    
    return pd.Series(Top15.groupby(by = [ContinentDict, bins]).size())
               % Renewable     
Asia           (2.212, 15.753]     4
               (15.753, 29.227]    1
Australia      (2.212, 15.753]     1
Europe         (2.212, 15.753]     1
               (15.753, 29.227]    3
               (29.227, 42.701]    2
North America  (2.212, 15.753]     1
               (56.174, 69.648]    1
South America  (56.174, 69.648]    1
dtype: int64

This function returns the Population Estimate series with values converted to a string with thousands separator (using commas)

e.g. 317615384.61538464 -> 317,615,384.61538464

def format_PopEst():
    Top15 = get_data()
    Top15['PopEst']=Top15['Energy Supply']/Top15['Energy Supply per Capita']
    Top15 = Top15['PopEst']
    return Top15.apply(lambda x: "{:,}".format(x))
China                 1,367,645,161.2903225
United States          317,615,384.61538464
Japan                  127,409,395.97315437
United Kingdom         63,870,967.741935484
Russian Federation            143,500,000.0
Canada                  35,239,864.86486486
Germany                 80,369,696.96969697
India                 1,276,730,769.2307692
France                  63,837,349.39759036
South Korea            49,805,429.864253394
Italy                  59,908,256.880733944
Spain                    46,443,396.2264151
Iran                    77,075,630.25210084
Australia              23,316,017.316017315
Brazil                 205,915,254.23728815
Name: PopEst, dtype: object