Category: Coding

Git – Removing Confidential Info From History

The first cut of code may contain … not best practice code. Sometimes this is just hard coding something you’ll want to compute / look up in the future. Hard coding user input isn’t a problem if my first cut is always searching for ABC123. Hard coding the system creds? Not good. You sort that before you actually deploy the code. But some old iteration of the file has MyP2s5w0rD sitting right there in plain text. That’s bad in a system that maintains file history! The quick/easy way to clean up passwords stashed within the history is to download the BFG JAR file.

For this test, I created a new repository in .\source then created three clones of the repo (.\clone1, .\clone2, and .\clone3). In each cloneX copy, I created a tX folder that has a file named ldapAuthTest.py — a file that contains a statically assigned password as

strSystemAccountPass = "MyP2s5w0rD"

The first thing I did was to redact the password in the files — this means anyone looking at HEAD won’t see the password. Source, clone1, and clone2 are all current. The clone3 copy has pulled all changes but has a local change committed but not merged.

To clean the password from the git history, first create a backup of your repo (just in case!). Then mirror the repo to work on it

mkdir mirror
cd mirror
git clone --mirror d:\git\testFilterBranch\source

 

Create file .\replacements.txt with the string to be redacted — in this case:

strSystemAccountPass = "MyP2s5w0rD"

Formatting notes for replacements.txt

MyP2s5w0rD # Replaces string with default ***REMOVED***
MyP2s4w0rD==>REDACTED # Replaces string using custom string REDACTED
MyP2s3w0rD==> # Replaces string with null -- i.e. removes the string
regex:strSystemAccountPass\s?=\s?"MyP2s2w0rD""==>REDACTED # Uses a regex match -- in this case we may or may not have a space around the equal sign

So, in my mirror folder, I have the replacement.txt file which defines which strings are replaced. I have a folder that contains the mirror of my repo.

lisa@FLEX3 /cygdrive/d/git/testFilterBranch/mirror
$ ls
replacements.txt source.git

To replace my “stuff”, run bfg using the –replace-text option. Because I only want to replace the text in files named ldapAuthTest.py, I also added the -fi option

java -jar ../bfg-1.14.0.jar --replace-text ..\replacements.txt -fi ldapAuthTest.py source.git

 

lisa@FLEX3 /cygdrive/d/git/testFilterBranch/mirror
$ java -jar ../bfg-1.14.0.jar --replace-text replacements.txt -fi ldapAuthTest.py source.git

Using repo : D:\git\testFilterBranch\mirror\source.git

Found 3 objects to protect
Found 2 commit-pointing refs : HEAD, refs/heads/master

Protected commits
-----------------
These are your protected commits, and so their contents will NOT be altered:
* commit 87f1b398 (protected by 'HEAD')

Cleaning
--------
Found 5 commits
Cleaning commits: 100% (5/5)
Cleaning commits completed in 613 ms.

Updating 1 Ref
--------------

Ref Before After
---------------------------------------
refs/heads/master | 87f1b398 | 919c8f0f

Updating references: 100% (1/1)
...Ref update completed in 151 ms.

Commit Tree-Dirt History
------------------------

Earliest Latest
| |
. D D D m

D = dirty commits (file tree fixed)
m = modified commits (commit message or parents changed)
. = clean commits (no changes to file tree)

Before After
-------------------------------------------
First modified commit | dc2cd935 | 8764f6f1
Last dirty commit | 9665c4e0 | ccdf0359

Changed files
-------------

Filename Before & After
-------------------------------------
ldapAuthTest.py | 25e79fa6 ? 4d12fdad

In total, 8 object ids were changed. Full details are logged here:
D:\git\testFilterBranch\mirror\source.git.bfg-report\2021-06-23\12-50-00

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

Check to make sure nothing looks abjectly wrong. Assuming the repo is sound, we’re ready to clean up and push these changes.

cd source.git

git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push

 

Pulling the update from my source repo, I have merge conflicts

These are readily resolved and the source repo can be merged into my local copy.

And the change I had committed but not pushed is still there.

Pushing that change produces no errors

Now … pushing the bfg changes may not work. In my case, the real repo has a bunch of branchs and I am getting “non fast-forward merges”. To get the history changed, I need to do a force push. Not so good for the other developers! In that case, everyone should get their changes committed and pushed. The servers should be checked to ensure they are up to date. Then the force push can be done and everyone can pull the new “good” data (which, really, shouldn’t differ from the old data … it’s just the history that is being tweaked).

On UUIDs

RFC 4122 UUID Versions:

1 — Datetime and MAC based
48-bit MAC address, 60-bit timestamp, 13-14 bit uniquifying sequence

2 — Datetime and MAC based with DCE security
8 least significant clock sequence numbers and least significant 32 bits of timestamp. RFC doesn’t reallly provide details on DCE security

3 — Hashed Namespace
MD5 hash of namespace

4 — Random
6 pre-determined bits (4 bits for version, 2-3 bits for variant 1 or 2) and 122 bits for 2^122 possible v4 variant 1 UUIDs

5 — Hashed Namespace
SHA-1 hash of namespace

In my case, I hesitate to use a v1 or v2 UUID because I have scripts executing in cron on the same host. The probability of the function being called at the same microsecond time seems higher than the pseudo-random number generator popping the same value in the handful of hours for which the UUIDs will be persisted for deduplication.

v3 or v5 UUIDs are my fallback position if we’re seeing dups in v4 — the namespace would need to glom together the script name and microsecond time to make a unique string when multiple scripts are running the function concurrently.

Kafka Troubleshooting (for those who enjoy reading network traces)

I finally had a revelation that allowed me to definitively prove that I am not doing anything strange that is causing duplicated messages to appear in the Kafka stream — it’s a clear text protocol! That means you can use Wireshark, tcpdump, etc to capture everything that goes over the wire. This shows that the GUID I generated for the duplicated message only appears one time in the network trace. Whatever funky stuff is going on that makes the client see it twice? Not me 😊

I used tcpdump because the batch server doesn’t have tshark (and it’s not my server, so I’m not going to go requesting additional binaries if there’s something sufficient for my need already available). Ran tcpdump -w /srv/data/ljr.cap port 9092 to grab everything that transits port 9092 while my script executed. Once the batch completed, I stopped tcpdump and transferred the file over to my workstation to view the capture in Wireshark. Searched the packet bytes for my duplicated GUID … and there’s only one.

Confluent Kafka Queue Length

The documentation for the Python Confluent Kafka module includes a len function on the producer. I wanted to use the function because we’re getting a number of duplicated messages on the client, and I was trying to isolate what might be causing the problem. Unfortunately, calling producer.len() failed indicating there’s no len() method. I used dir(producer) to show that, no, there isn’t a len() method.

I realized today that the documentation is telling me that I can call the built-in len() function on a producer to get the queue length.

Code:

print(f"Before produce there are {len(producer)} messages awaiting delivery")
producer.produce(topic, key=bytes(str(int(cs.timestamp) ), 'utf8'), value=cs.SerializeToString() )
print(f"After produce there are {len(producer)} messages awaiting delivery")
producer.poll(0) # Per https://github.com/confluentinc/confluent-kafka-python/issues/16 for queue full error
print(f"After poll0 there are {len(producer)} messages awaiting delivery")

Output:

Before produce there are 160 messages awaiting delivery
After produce there are 161 messages awaiting delivery
After poll0 there are 155 messages awaiting delivery

Boolean Opts in Python

I have a few command line arguments on a Python script that are most readily used if they are boolean. I sometimes need a “verbose” option for script debugging — print a lot of extra stuff to show what’s going on, and I usually want a “dry run” option where the script reads data, performs calculations, and prints results to the screen without making any changes or sending data anywhere (database, email, etc). To use command line arguments as boolean values, I use a function that converts a variety of possible inputs to True/False.

def string2boolean(strInput):
    """
    :param strInput: String string to be converted to boolean
    :return: Boolean representation of input
    """
    if isinstance(strInput, bool):
        return strInput
    if strInput.lower() in ('yes', 'true', 't', 'y', '1'):
        return True
    elif strInput.lower() in ('no', 'false', 'f', 'n', '0'):
        return False
    else:
        raise argparse.ArgumentTypeError('Boolean value expected.')

Use “type” when adding the argument to run the input through your function.

    parser.add_argument('-r', '--dryrun', action='store', type=string2boolean, dest='boolDryRun', default=False, help="Preview data processing without sending data to DB or Kafka. Valid values: 'true' or 'false'.")

Oracle – Collections and IN or LIKE Queries

I’ve been retrofitting a lot of PHP/SQL queries to use oci_bind_by_name recently. When using “IN” clauses, you can iterate through your array twice, but it’s an inefficient approach.

// Build the query string with a bunch of placeholders
$strQuery = "select Col1, Col2, Col3 from TableName where ColName IN (";
for($i=0; $i < count($array); $i++){
    if($i > 0){
        $strQuery = $strQuery . ", ";
    }
    $strQuery = $strQuery . ":bindvar" . $i;
}
$strQuery = $strQuery . ")";
... 
// Then bind each placeholder to something
for($i=0; $i < count($array); $i++){
    oci_bind_by_name($stmt, ":bindvar".$i, $array[$i]);
}

Building a table from the array data and using an Oracle collection object creates cleaner code and avoids a second iteration of the array:

$strQuery = "SELECT indexID, objName FROM table WHERE objName in (SELECT column_value FROM table(:myIds))";
$stmt = oci_parse($conn, $strQuery);

$coll = oci_new_collection($kpiprd_conn, 'ODCIVARCHAR2LIST','SYS');
foreach ($arrayValues as $strValue) {
     $coll->append($strValue);
}
oci_bind_by_name($stmt, ':myIds', $coll, -1, OCI_B_NTY);
oci_set_prefetch($stmt, 300);
oci_execute($stmt);

A simple like clause is quite straight-forward

$strNameLikeString = "SomeName%";
$strQuery = "SELECT ds_dvrsty_set_nm from ds_dvrsty_set WHERE ds_dvrsty_set_nm LIKE :divsetnm ORDER BY ds_dvrsty_set_nm DESC fetch first 1 row only";

$stmt = oci_parse($connDB, $strQuery);
oci_bind_by_name($stmt, ":divsetnm", $strNameLikeString);
oci_set_prefetch($stmt, 300);
oci_execute($stmt);

But what about an array of inputs essentially reproducing the LIKE ANY predicate in PostgreSQL? There’s not a direct equivalent in Oracle, and iterating through the array twice to build out a query WHERE (Field1 LIKE ‘Thing1%’ OR Field1 LIKE ‘Thing2%’ OR Field1 LIKE ‘Thing3%’) is undesirable. The with EXISTS allows me to create a LIKE ANY type query and only iterate through my array once to bind variables to placeholders using the same collection approach as was used with the IN clause.

$arrayLocs = array('ERIEPAXE%', 'HNCKOHXA%', 'LTRKARXK%');
$strQuery = "SELECT location_id, clli_code FROM network_location WHERE EXISTS (select 1 FROM TABLE(:likelocs) WHERE clli_code LIKE column_value)";
$stmt = oci_parse($connDB, $strQuery);

$coll = oci_new_collection($connDB, 'ODCIVARCHAR2LIST','SYS');
foreach ($arrayLocs as $strLocation) {
    $coll->append($strLocation);
}
oci_bind_by_name($stmt, ':likelocs', $coll, -1, OCI_B_NTY);
oci_execute($stmt);
print "<table>\n";
print "<tr><th>Loc ID</th><th>CLLI</th></tr>\n";
while ($row = oci_fetch_array($stmt, OCI_ASSOC+OCI_RETURN_NULLS)) {
    print "<tr><td>" . $row['LOCATION_ID'] . "</td><td>" . $row['CLLI_CODE'] . "</td></tr>\n";
}
print "</table>\n";

There are many different collection types in Oracle which can be used with oci_new_collection. A full list of the system collection types can be queried from the database.

SELECT * FROM SYS.ALL_TYPES WHERE TYPECODE = 'COLLECTION' and OWNER = 'SYS';

JQuery – Finding a set of checkboxes

A corollary to my JavaScript modifying checkbox values when the box is checked or unchecked … I needed a way to reset the form (in my form, the default is for the boxes to be checked and the value to be 1). The following code identifies all checkboxes with a particular class, checks them, and sets the value to 1.

/**
 * This function checks off each checkbox of the input class
 *
 * @param {string} strCheckboxClass     Name of class identifying in-scope checkboxes
 * @return {null} 
 *
 * @example
 *
 *     checkAllDatabases ('MyBoxes');
 */
 function checkAllDatabases(strCheckboxClass){
    arrayCheckboxes = $('.'+strCheckboxClass);
    for(i = 0; i < arrayCheckboxes.length; i++) {
        $( '#'+arrayCheckboxes[i].name).prop( "checked", true );
        $( '#'+arrayCheckboxes[i].name).val(1);
    } 
}

Changing checkbox value when (un)checked

This bit of code handles another rather esoteric scenario — I have a generic “go to this URL and download the resultant Excel file” JavaScript function. This is because I write a lot of reporting tools and didn’t want to write a lot of code for each new tool. The template is an input form with a submit button that calls the generic function. Params for the elements on the form from which values are read, the URL to call to generate the report, and the POST elements into which each corresponding form value is inserted gets stuffed. Works great for text inputs. Works fine for drop-downs. But the value of a checkbox is really a combination of the potential value (from the value tag) and the checked state. That is — my Button 1 has a potential value of 1, but if the box is checked or not is really important.

Instead of attempting to determine the type of element in each form input so I can evaluate the checked condition, I decided to just change the value when the checkbox state is changed. Now Button 1 has a potential value of 0 when unchecked and a potential value of 1 when checked. I don’t need to know if the box is checked because the value answers that question. So passing along button1’s value to my URL lets the target site know if I want whatever Button 1 represents. (In this case, users are able to select from a list of seven data sources — smaller numbers of data sources reduce the query time but also fail to provide the most robust report).

The JavaScript to handle changing the checkbox value when the checked state changes:

$("#button1").change(function () {
    if ($("#button1").is(':checked')) {
        $("#button1").val(1);
    }
    else{
        $("#button1").val(0);
    }
});

$("#button2").change(function () {
    if ($("#button2").is(':checked')) {
        $("#button2").val(1);
    }
    else{
        $("#button2").val(0);
    }
});

The HTML defining these two checkboxes:

<input type="checkbox" id="button1" name="button1" value="1" checked><label for="ngmss">Thing 1</label>
<input type="checkbox" id="button2" name="button2" value="1" checked><label for="ngmss">Thing 2</label>

Google OAUTH Stuff

Reminder to self — when you set up a desktop app with OAUTH to use the Google APIs … you have to hit the authorization URL from the computer running the code. That means, for my calendar scraper, that I need to do X-redirection from the server & run the script. Firefox launches & the flow actually completes. Attempting to hit the URL from my computer yields a connection failure to the https://localhost:SomePort at the end of the workflow.

Move token.pickle to backup file, run getCalendarEvents.py with X-redirection so auth can be processed through web form.

Python: dir

I am writing this down because I never manage to remember these two super useful functions that tells you what a variable is.

iLastProcessedTimestamp = 0
with open(‘test.txt’) as f:
iLastProcessedTimestamp = int(f.readline())
print(dir(iLastProcessedTimestamp))
print(type(iLastProcessedTimestamp))

The type function tells you the variable’s class (in this case, int). The dir function tells you the attributes of the variable.