"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Outline\n",
" \n",
" 1) Handling Large Datasets with Pandas\n",
" 2) Data Fitting with SciPy\n",
" 3) Interpolation with SciPy\n",
" 4) Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Handling Large Datasets with Pandas\n",
"Thus far we have only covered very simplistic cases where all the data can be stored as individual values or multi-dimensional arrays. However, sometimes datasets can be much larger and require more complex handling structures. This is exactly where the Pandas library comes into play.\n",
"\n",
"Pandas is an incredible library capable of doing almost everything needed for large dataset analysis. For documentation or instructions on how to install the library go to their website: https://pandas.pydata.org/\n",
"In general, there are two types of Pandas objects:\n",
"\n",
"* Series - A one-dimensional data structure that can store values, and for every value it holds a unique index, too (sounds familiar?).\n",
"* DataFrame - A two (or more) dimensional data structure. Effectively a table with rows and columns. The columns have names and the rows have indexes.\n",
"\n",
"Since Series are fancier arrays, I will focus only on DataFrames in this lecture. First, as always, you need to remember to add the library into your code, like this:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading and Printing Data Files\n",
"Pandas can take almost any data files or data list as input and convert it into a DataFrame that is easy to use (whether you are working with text files, CSV files, SQL files, and others). To import data you only need a simple line of code using a function called read_csv(). Don't let the name fool you, the function reads all the data files listed above. The read_csv() function takes several inputs, here the essentials: \n",
"\n",
"* File Path & Name: Path to the file and file name, so that Pandas can access the data (NB: Always provide the full directory if you can).\n",
"* Sep: This input parameter tells Pandas how the data is formatted inside the file (tab-edited, comma-separated, etc).\n",
"* Headers: This lets you pass an array with the name of each row. I recommend setting this to 0 (header=0).\n",
"* Names: Similarly to Headers it lets you re-name each column. I suggest to always label your columns to make your life easy later. \n",
"\n",
"For instance, here is an example by importing a text file created from the ESTAR dataset (Stopping Powers and Range Tables for Electrons). Note that the file is already loaded in the course directory."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"file = 'Lecture3_DataFile.txt' ## Path + Name of the Data File\n",
"names = ['Energy','Collision SP','Radiative SP','Range'] ## Array containing the columns names\n",
"DataFrame = pd.read_csv(file, header=None, sep='\\t', names=names) ## Import file to DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can always display your DataFrames using the following two methods:\n",
"\n",
"* print(DataFrame) - Prints the Pandas data frame according to screen space.\n",
"* DataFrame_Name - Prints the full data frame with Pandas graphics."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Energy Collision SP Radiative SP Range\n",
"0 0.0100 51.240 0.000970 0.000108\n",
"1 0.0125 42.710 0.000979 0.000161\n",
"2 0.0150 36.810 0.000988 0.000225\n",
"3 0.0175 32.490 0.000996 0.000297\n",
"4 0.0200 29.160 0.001004 0.000378\n",
".. ... ... ... ...\n",
"76 600.0000 5.553 8.821000 68.480000\n",
"77 700.0000 5.577 10.360000 75.080000\n",
"78 800.0000 5.597 11.910000 81.070000\n",
"79 900.0000 5.616 13.460000 86.540000\n",
"80 1000.0000 5.632 15.020000 91.570000\n",
"\n",
"[81 rows x 4 columns]\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Energy
\n",
"
Collision SP
\n",
"
Radiative SP
\n",
"
Range
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0.0100
\n",
"
51.240
\n",
"
0.000970
\n",
"
0.000108
\n",
"
\n",
"
\n",
"
1
\n",
"
0.0125
\n",
"
42.710
\n",
"
0.000979
\n",
"
0.000161
\n",
"
\n",
"
\n",
"
2
\n",
"
0.0150
\n",
"
36.810
\n",
"
0.000988
\n",
"
0.000225
\n",
"
\n",
"
\n",
"
3
\n",
"
0.0175
\n",
"
32.490
\n",
"
0.000996
\n",
"
0.000297
\n",
"
\n",
"
\n",
"
4
\n",
"
0.0200
\n",
"
29.160
\n",
"
0.001004
\n",
"
0.000378
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
76
\n",
"
600.0000
\n",
"
5.553
\n",
"
8.821000
\n",
"
68.480000
\n",
"
\n",
"
\n",
"
77
\n",
"
700.0000
\n",
"
5.577
\n",
"
10.360000
\n",
"
75.080000
\n",
"
\n",
"
\n",
"
78
\n",
"
800.0000
\n",
"
5.597
\n",
"
11.910000
\n",
"
81.070000
\n",
"
\n",
"
\n",
"
79
\n",
"
900.0000
\n",
"
5.616
\n",
"
13.460000
\n",
"
86.540000
\n",
"
\n",
"
\n",
"
80
\n",
"
1000.0000
\n",
"
5.632
\n",
"
15.020000
\n",
"
91.570000
\n",
"
\n",
" \n",
"
\n",
"
81 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Energy Collision SP Radiative SP Range\n",
"0 0.0100 51.240 0.000970 0.000108\n",
"1 0.0125 42.710 0.000979 0.000161\n",
"2 0.0150 36.810 0.000988 0.000225\n",
"3 0.0175 32.490 0.000996 0.000297\n",
"4 0.0200 29.160 0.001004 0.000378\n",
".. ... ... ... ...\n",
"76 600.0000 5.553 8.821000 68.480000\n",
"77 700.0000 5.577 10.360000 75.080000\n",
"78 800.0000 5.597 11.910000 81.070000\n",
"79 900.0000 5.616 13.460000 86.540000\n",
"80 1000.0000 5.632 15.020000 91.570000\n",
"\n",
"[81 rows x 4 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(DataFrame) ## Standard Python printing method\n",
"DataFrame ## Uses Pandas graphics for data frame display"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes, it’s good not to print the whole DataFrame and fill your screen with numbers and lines. If you simply want to check that some operation worked, you can use the following commands:\n",
"\n",
"* DataFrame.head() - Just prints the first 5 elements, with index and names.\n",
"* DataFrame.tail() - Just prints the last 5 elements, with index and names.\n",
"* DataFrame.sample(n) - Prints n elements, randomly picked from the DataFrame index. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Energy Collision SP Radiative SP Range\n",
"36 2.00 3.823 0.01162 0.474400\n",
"15 0.09 9.367 0.00119 0.005544\n",
"69 250.00 5.417 3.49600 37.850000\n",
"58 50.00 5.090 0.59590 10.150000\n",
"63 90.00 5.238 1.15300 16.770000"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame.sample(5) ## Print 5 lements, randomly selected"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Final note, before moving on to Data Frame handling, you can always access any given value in the Data Frame by simply specifying the name of the column and the index number (similar to multi-D arrays). "
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Entry 74 in Energy: 450.0\n"
]
}
],
"source": [
"print('Entry 74 in Energy:',DataFrame['Energy'][73]) ## Print the 74th element in the DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Frame Handling\n",
"First off, you can always create new Data Frames based on an already existing Data Frame. This is particularly handy if you need to edit certain variables (or columns) for a given analysis and you don't want to modify the original data. This can be achieved by slicing the Data Frame like this:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Range
\n",
"
Energy
\n",
"
\n",
" \n",
" \n",
"
\n",
"
76
\n",
"
68.48
\n",
"
600.0
\n",
"
\n",
"
\n",
"
77
\n",
"
75.08
\n",
"
700.0
\n",
"
\n",
"
\n",
"
78
\n",
"
81.07
\n",
"
800.0
\n",
"
\n",
"
\n",
"
79
\n",
"
86.54
\n",
"
900.0
\n",
"
\n",
"
\n",
"
80
\n",
"
91.57
\n",
"
1000.0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Range Energy\n",
"76 68.48 600.0\n",
"77 75.08 700.0\n",
"78 81.07 800.0\n",
"79 86.54 900.0\n",
"80 91.57 1000.0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"NewDataFrame = DataFrame[['Range','Energy']] ## New df based on a subset of DataFrame\n",
"NewDataFrame.tail() ## Print last 5 elements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note the double brackets. The outer bracket frames tell pandas that you want to select columns, and the inner brackets are for the list of the column names. This slicing can be done with any number of columns. \n",
"\n",
"Now, say that instead of selecting given columns, you are interested only in an object inside the Data Frame that meets a certain requirement. Here how to skim through the Data Frame quickly, while searching for conditionals. For example, imagine if you want to select only energy ranges that are < than 5 MeV."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Energy
\n",
"
Collision SP
\n",
"
Radiative SP
\n",
"
Range
\n",
"
\n",
" \n",
" \n",
"
\n",
"
37
\n",
"
2.5
\n",
"
3.873
\n",
"
0.01534
\n",
"
0.6039
\n",
"
\n",
"
\n",
"
38
\n",
"
3.0
\n",
"
3.924
\n",
"
0.01931
\n",
"
0.7316
\n",
"
\n",
"
\n",
"
39
\n",
"
3.5
\n",
"
3.973
\n",
"
0.02348
\n",
"
0.8576
\n",
"
\n",
"
\n",
"
40
\n",
"
4.0
\n",
"
4.020
\n",
"
0.02782
\n",
"
0.9819
\n",
"
\n",
"
\n",
"
41
\n",
"
4.5
\n",
"
4.063
\n",
"
0.03230
\n",
"
1.1050
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Energy Collision SP Radiative SP Range\n",
"37 2.5 3.873 0.01534 0.6039\n",
"38 3.0 3.924 0.01931 0.7316\n",
"39 3.5 3.973 0.02348 0.8576\n",
"40 4.0 4.020 0.02782 0.9819\n",
"41 4.5 4.063 0.03230 1.1050"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"LowEnergyDF = DataFrame[DataFrame.Energy < 5.0] ## Vreate New DataFrame filled with entries of Energy < 5.0 MeV\n",
"LowEnergyDF.tail() ## Print the last 5 elements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas have a series of very simple one-line-commands that will let do basic operation on an individual or multiple columns at the same time. Here the command list:\n",
"\n",
"* count() - Returns the number of rows in each columns \n",
"* sum() - Returns the sum of all entries in a given column(s). (NB: Thinks get funky if you use this for non-numbers).\n",
"* min() - Returns the smallest value in the selected column(s).\n",
"* max() - Returns the maximum value in a given column(s).\n",
"* mean() - Returns the mean of the selected column(s).\n",
"\n",
"NB: all commands can be used for a series of columns or individually."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of Entries in Energy: 81\n",
"Sum of All Ranges: 1044.3855251999998\n",
"Minimum Collision Stopping Power: 3.787\n",
"Minimum Radiative Stopping Power: 15.02\n",
"Minimum Collision Stopping Power: 12.893648459259257\n"
]
}
],
"source": [
"print('Number of Entries in Energy:',DataFrame.Energy.count()) ## Print the entries in Energy\n",
"print('Sum of All Ranges:',DataFrame['Range'].sum()) ## Print sum of all entries in Range\n",
"print('Minimum Collision Stopping Power:',DataFrame['Collision SP'].min()) ## Print the minimum value in Collision Stopping Power\n",
"print('Minimum Radiative Stopping Power:',DataFrame['Radiative SP'].max()) ## Print the minimum value in Radiative Stopping Power\n",
"print('Minimum Collision Stopping Power:',DataFrame.Range.mean()) ## Print the minimum value in Collision Stopping Power"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Basic operations between columns can also be executed and stored in a new column that gets added at the end of the Data Frame. Any kind of Python native operation can be done here (+, -, *, %, etc)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Energy
\n",
"
Collision SP
\n",
"
Radiative SP
\n",
"
Range
\n",
"
RangeTimesColl
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
0.0100
\n",
"
51.24
\n",
"
0.000970
\n",
"
0.000108
\n",
"
0.005513
\n",
"
\n",
"
\n",
"
1
\n",
"
0.0125
\n",
"
42.71
\n",
"
0.000979
\n",
"
0.000161
\n",
"
0.006889
\n",
"
\n",
"
\n",
"
2
\n",
"
0.0150
\n",
"
36.81
\n",
"
0.000988
\n",
"
0.000225
\n",
"
0.008264
\n",
"
\n",
"
\n",
"
3
\n",
"
0.0175
\n",
"
32.49
\n",
"
0.000996
\n",
"
0.000297
\n",
"
0.009650
\n",
"
\n",
"
\n",
"
4
\n",
"
0.0200
\n",
"
29.16
\n",
"
0.001004
\n",
"
0.000378
\n",
"
0.011031
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Energy Collision SP Radiative SP Range RangeTimesColl\n",
"0 0.0100 51.24 0.000970 0.000108 0.005513\n",
"1 0.0125 42.71 0.000979 0.000161 0.006889\n",
"2 0.0150 36.81 0.000988 0.000225 0.008264\n",
"3 0.0175 32.49 0.000996 0.000297 0.009650\n",
"4 0.0200 29.16 0.001004 0.000378 0.011031"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame['RangeTimesColl'] = DataFrame['Range']*DataFrame['Collision SP'] ## Create new column \n",
"DataFrame.head() ## Print the top of the Data Frame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Frame Sorting\n",
"One of the most useful functionalities of Pandas is the ability to sort any dataset based on a given category (or column). You can even make a combination of sorting (based on 2 or more columns). The command to sort a Data Frame is sort_values(). Note that the default is to sort in descending (or alphabetical) order, but descending order can also be requested (ascending = False)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Energy Collision SP Radiative SP Range RangeTimesColl\n",
"0 0.0100 51.240 0.000970 0.000108 0.005513\n",
"1 0.0125 42.710 0.000979 0.000161 0.006889\n",
"2 0.0150 36.810 0.000988 0.000225 0.008264\n",
"3 0.0175 32.490 0.000996 0.000297 0.009650\n",
"4 0.0200 29.160 0.001004 0.000378 0.011031\n",
".. ... ... ... ... ...\n",
"36 2.0000 3.823 0.011620 0.474400 1.813631\n",
"32 1.0000 3.815 0.005152 0.211700 0.807635\n",
"35 1.7500 3.802 0.009862 0.409000 1.555018\n",
"34 1.5000 3.788 0.008190 0.343300 1.300420\n",
"33 1.2500 3.787 0.006614 0.277400 1.050514\n",
"\n",
"[81 rows x 5 columns]"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame.sort_values(by=['Collision SP'], ascending = False) ## Sort the Data Frame based on Collision SP"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see from the example above, even after sorting the index is kept as the original ordering. If you want to reset the index to match the new order you need the function .reset_index(). Note, that this will automatically generate a new column, placed at the beginning, containing the original index."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
index
\n",
"
Energy
\n",
"
Collision SP
\n",
"
Radiative SP
\n",
"
Range
\n",
"
RangeTimesColl
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
33
\n",
"
1.2500
\n",
"
3.787
\n",
"
0.006614
\n",
"
0.277400
\n",
"
1.050514
\n",
"
\n",
"
\n",
"
1
\n",
"
34
\n",
"
1.5000
\n",
"
3.788
\n",
"
0.008190
\n",
"
0.343300
\n",
"
1.300420
\n",
"
\n",
"
\n",
"
2
\n",
"
35
\n",
"
1.7500
\n",
"
3.802
\n",
"
0.009862
\n",
"
0.409000
\n",
"
1.555018
\n",
"
\n",
"
\n",
"
3
\n",
"
32
\n",
"
1.0000
\n",
"
3.815
\n",
"
0.005152
\n",
"
0.211700
\n",
"
0.807635
\n",
"
\n",
"
\n",
"
4
\n",
"
36
\n",
"
2.0000
\n",
"
3.823
\n",
"
0.011620
\n",
"
0.474400
\n",
"
1.813631
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
76
\n",
"
4
\n",
"
0.0200
\n",
"
29.160
\n",
"
0.001004
\n",
"
0.000378
\n",
"
0.011031
\n",
"
\n",
"
\n",
"
77
\n",
"
3
\n",
"
0.0175
\n",
"
32.490
\n",
"
0.000996
\n",
"
0.000297
\n",
"
0.009650
\n",
"
\n",
"
\n",
"
78
\n",
"
2
\n",
"
0.0150
\n",
"
36.810
\n",
"
0.000988
\n",
"
0.000225
\n",
"
0.008264
\n",
"
\n",
"
\n",
"
79
\n",
"
1
\n",
"
0.0125
\n",
"
42.710
\n",
"
0.000979
\n",
"
0.000161
\n",
"
0.006889
\n",
"
\n",
"
\n",
"
80
\n",
"
0
\n",
"
0.0100
\n",
"
51.240
\n",
"
0.000970
\n",
"
0.000108
\n",
"
0.005513
\n",
"
\n",
" \n",
"
\n",
"
81 rows × 6 columns
\n",
"
"
],
"text/plain": [
" index Energy Collision SP Radiative SP Range RangeTimesColl\n",
"0 33 1.2500 3.787 0.006614 0.277400 1.050514\n",
"1 34 1.5000 3.788 0.008190 0.343300 1.300420\n",
"2 35 1.7500 3.802 0.009862 0.409000 1.555018\n",
"3 32 1.0000 3.815 0.005152 0.211700 0.807635\n",
"4 36 2.0000 3.823 0.011620 0.474400 1.813631\n",
".. ... ... ... ... ... ...\n",
"76 4 0.0200 29.160 0.001004 0.000378 0.011031\n",
"77 3 0.0175 32.490 0.000996 0.000297 0.009650\n",
"78 2 0.0150 36.810 0.000988 0.000225 0.008264\n",
"79 1 0.0125 42.710 0.000979 0.000161 0.006889\n",
"80 0 0.0100 51.240 0.000970 0.000108 0.005513\n",
"\n",
"[81 rows x 6 columns]"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DataFrame.sort_values(by=['Collision SP']).reset_index() ## Reset Index Based on Last Ordering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example 3.1.1\n",
"You are trying to measure the envelope of a certain beam at TRIUMF, and you just completed a measurement using two different detectors (DET1 and DET2). The data from your measurements are saved in Example311.txt (tab edited). The file contains 4 columns: DetectorType, X, Y, Xerr, Yerr (all data in units of [mm]). Load the data into a Pandas data frame. Then do the following:\n",
"\n",
"* Print the first 5 elements, the last 5 elements and 5 random elements on the screen.\n",
"* Create a new Data Frame selecting only entries for DET1.\n",
"* Find the min, max, and mean in X and Y.\n",
"* Add the errors together, and create a new column with the result.\n",
"* Sort based on the column generated in the last bullet point, and change the index to match the new ordering."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"X min: 0.0 X max: 15.0 X mean: 7.0\n",
"Y min: 0.0 Y max: 30.0 Y mean: 13.88888888888889\n"
]
},
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
level_0
\n",
"
index
\n",
"
DetectorType
\n",
"
X
\n",
"
Y
\n",
"
Xerr
\n",
"
Yerr
\n",
"
Error Tot
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
11
\n",
"
19
\n",
"
DET1
\n",
"
4.0
\n",
"
15.0
\n",
"
0.13
\n",
"
0.43
\n",
"
0.56
\n",
"
\n",
"
\n",
"
1
\n",
"
14
\n",
"
26
\n",
"
DET1
\n",
"
2.0
\n",
"
17.0
\n",
"
0.57
\n",
"
0.11
\n",
"
0.68
\n",
"
\n",
"
\n",
"
2
\n",
"
19
\n",
"
35
\n",
"
DET1
\n",
"
6.0
\n",
"
0.0
\n",
"
0.28
\n",
"
0.41
\n",
"
0.69
\n",
"
\n",
"
\n",
"
3
\n",
"
21
\n",
"
39
\n",
"
DET1
\n",
"
4.0
\n",
"
21.0
\n",
"
0.01
\n",
"
0.78
\n",
"
0.79
\n",
"
\n",
"
\n",
"
4
\n",
"
13
\n",
"
25
\n",
"
DET1
\n",
"
15.0
\n",
"
11.0
\n",
"
0.10
\n",
"
0.91
\n",
"
1.01
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" level_0 index DetectorType X Y Xerr Yerr Error Tot\n",
"0 11 19 DET1 4.0 15.0 0.13 0.43 0.56\n",
"1 14 26 DET1 2.0 17.0 0.57 0.11 0.68\n",
"2 19 35 DET1 6.0 0.0 0.28 0.41 0.69\n",
"3 21 39 DET1 4.0 21.0 0.01 0.78 0.79\n",
"4 13 25 DET1 15.0 11.0 0.10 0.91 1.01"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"########################################################\n",
"## Example 3.1.1 ##\n",
"## Introduction to Scinetific Programming with Python ##\n",
"## ##\n",
"## Pietro Giampa, TRIUMF, 2020 ##\n",
"########################################################\n",
"\n",
"### Define Needed Libraries ###\n",
"import pandas as pd\n",
"\n",
"### Import Data from File ###\n",
"Filename = 'Example311.txt'\n",
"names = ['DetectorType','X','Y','Xerr','Yerr'] \n",
"DF = pd.read_csv(Filename,header=None,sep='\\t',names=names) ## Import Data\n",
"\n",
"### Print the first 5 elements, the last 5 elements and 5 random elements ###\n",
"DF.head() ## First 5 elements\n",
"DF.tail() ## Last 5 elements\n",
"DF.sample(5) ## 5 random elements\n",
"\n",
"### Create a new Data Frame selecting only entries for DET1 ###\n",
"DF_DET1 = DF[DF.DetectorType == 'DET1']\n",
"DF_DET1 = DF_DET1.reset_index() ## Reset the index due to slicing\n",
"\n",
"### Find the min, max, and mean in X and Y ###\n",
"X_min = DF_DET1['X'].min()\n",
"X_max = DF_DET1['X'].max()\n",
"X_mean = DF_DET1['X'].mean()\n",
"Y_min = DF_DET1['Y'].min()\n",
"Y_max = DF_DET1['Y'].max()\n",
"Y_mean = DF_DET1['Y'].mean()\n",
"print('X min:',X_min,'X max:',X_max,'X mean:',X_mean)\n",
"print('Y min:',Y_min,'Y max:',Y_max,'Y mean:',Y_mean)\n",
"\n",
"### Add the errors in quadrature, and create a new column with the resul ###\n",
"DF_DET1['Error Tot'] = DF_DET1['Xerr']+DF_DET1['Yerr']\n",
"\n",
"### Sort based on the column generated in the last bullet point, and change the index to match the new ordering ###\n",
"DF_DET1.sort_values(by=['Error Tot']).reset_index().head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Fitting With SciPy\n",
"Data fitting is one of the most useful toolsets for data analysis, it's the ultimate test between data and prediction. In Python, there are different options for data fitting. However, in this course, we will only review a method from the SciPy library called curve_fit. This is the most versatile fitting tool and adequate for all scientific levels.\n",
"\n",
"Cureve_fit requires a fitting function, a dataset input and a series of optional parameters. Here the most common and useful input parameters:\n",
"\n",
"* ydata - can be used to weight the fed dataset.\n",
"* p0 - Prior for floating parameters, this must be input as an array.\n",
"* bounds - Set limits on parameters floats.\n",
"* method - Allows you to change the minimization method.\n",
"\n",
"Moreover, the cureve_fit function returns two arrays:\n",
"\n",
"* popt - Optimal values for the parameters so that the sum of the squared residuals off(xdata, *popt) - ydata is minimized\n",
"* pcov - The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)). How the sigma parameter affects the estimated covariance depends on absolute_sigma argument, as described above.\n",
"\n",
"Remember that to run curve_fit you need to import the following library in your code:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"from scipy.optimize import curve_fit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example 3.2.1\n",
"Create a Gaussian function that receives as input parameters, $A$, $\\mu$, $\\sigma$. Your code should then generate 100 random numbers distributed following a Gaussian distribution with A=100.0, $\\mu$=55.0, $\\sigma$=32.5. Finally, fit the generated data with a gaussian fit. Print the extracted parameters on the screen next to the input parameters. Lastly. plot the data and the fit on a Canvas (include all labels).\n",
"\n",
"Recall: Gaussian distribution:\n",
"\n",
"$f(x) = A \\cdot e^{\\frac{-(x-\\mu)^{2}}{\\sigma}}$"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"A Prior: 100.0 | A Posterior: 100.0\n",
"mu Prior: 55.0 | mu Posterior: 55.0\n",
"sig Prior: 32.5 | sig Posterior: 32.5\n"
]
},
{
"data": {
"text/plain": [
"