You can always refer to this LinkedIn Learning Video Using Pythion with Excel.

Install Python Interpreter

Interpreter in computer are necessary in every programming, like it’s name, it “translate“ our human readable code into 0 and 1 computer readable “language“.

  1. Download the newest Python interpreter from python.org.

  2. Run .exe file, use default download options.

  3. You need a Python IDE (Integrated Development Environment) or a code editor to write code. Recommended using IDE.

    • Windows OS: JetBrains PyCharm, VS Code
    • MacOS: JetBrains PyCharm, VS Code, Komodo, Atom

Python “Pandas” Basic

Pandas library - primarily used in data science, provide Python function, statement that make Excel sorting, examining and modifying very easy.

You can always refer to Pandas Official API Document, and Pandas Official User Guide.

Install Pandas

  1. Open command prompt:
  2. Install Python Pandas: in command prompt type py -m pip install pandas
  3. Import Pandas: in command prompt py, then import pandas

Import “Pandas”

Import Pandas library that we can use it’s function and statement. For example the one we used next step pd.read_csv.

1
2
import pandas as pd
from openpyx1.workbook import Workbook

Loading and saving data

Lode excel and txt data into Python data frame to be able to work on it.

1
2
3
df_excel = pd.read_excel('regions.xlsx')			# upload Excel file; 
df_csv = pd.read_csv('Names.csv', header=None) # upload csv file; Use "header=None" syntax if the datasheet has no header
df_txt = pd.read_csv('data.txt', delimiter='\t') # upload text file; Use "delimiter='\t" syntax when loading txt file
  • df short for Data Frame
  • Use .read_csv for both .csv and .txt file, because .csv and .txt are very similar in terms how they column their data.

Save as the modified file into a new Excel file. The saving step will be omitted in subsequent presentations.

1
df_csv.to.excel('modified.xlsx')

Python “Pandas” in practices

Add header for data sheet

1
2
df_csv = pd.read_csv('Names.csv', header=None)	# load csv file
df_csv.columns = ['First','Last','Address','City','State','Postcode'] # defien the name for each columns

addHeader

View number of column selectively

1
2
3
4
5
6
7
8
df = pd.read_csv('Names.csv', header=None)	# load csv file
df_csv.columns = ['First','Last','Address','City','State','Postcode']
wanted_values = df[['First','Last','Post Code']]

# or
df = pd.read_csv('Names.csv', header=None) # load csv file
df_csv.columns = ['First','Last','Address','City','State','Postcode']
df.drop = ['Address','City','State']
  • wanted_values - select the columns want to show
  • df.drop - select the columns don’t want to show

addHeader

Filter data

This case we want to filter the data with First name Stephen who live in SomeTown.

1
2
3
df = pd.read_csv('Names.csv', header=None)	# load csv file
df_csv.columns = ['First','Last','Address','City','State','Postcode','Income']
df_filter = df.loc[(df['City'] == 'SomeTown') & (df['First'] == 'Stephen')]

Filter judgment has tow part

  • (df['City'] == 'SomeTown') - find all ‘SomeTown’ in ‘City’ column

  • (df['First'] == 'Stephen') - find all ‘Stephen’ in ‘First’ column

filter

Apply Logical expression on data

Apply filter using one line of code of if()...else... function, and store output in a new named column.

To do the same thing, Microsoft Excel requre more effort, time and understand.

You need select “Sort and filter” –> “customer sort” –> select data range –> select output range –> define logical filter …

1
2
3
df = pd.read_csv('Names.csv', header=None)	# load csv file
df_csv.columns = ['First','Last','Income']
df['Tax %'] = df['Income'].apply(lambda x: .15 if 10000<x<40000 else .2 40001<x<80000 else .25)
  • df[‘Tax %’]` - create a new column call ‘Tax %’, and store calculate value for each people/unit.

  • df['Income'].apply(...) - with ‘Income’, apply following function …

    • lambda x - x represent different ‘Income’ for each people/unit

    • .15 if 10000<x<40000 - output 15% if ‘Income’ is between 10000 ~ 40000

    • else .2 40000<x<80000 - output 20% if ‘Income’ is between 40001 ~ 80000

    • else .25 - output 25% beyond 80000

      logicalEx

Calculate data

Simple calculation, one column multiple other column and create a new column to save the output.

1
2
3
df = pd.read_csv('Names.csv', header=None)	# load csv file
df_csv.columns = ['First','Last','Income']
df['Taxes Owed'] = df['Income'] * df['Tax %']

calculate

Simple Classification data analysis

We classification/grouping according to “Test column”, which is “True” and “False”, and calculate the mean value for “Income”, “Tax %” and “Taxes Owned”.

1
2
3
4
df = pd.read_csv('Names.csv', header=None)	# load csv file
df_csv.columns = ['First','Last','Income','Tax %','Test Column']
df.drop = ['First','Last']
df.groupby(['Test Column']).mean()

classAna