Title
David's blog -
Go Home
Category
Description
Address
Phone Number
+1 609-831-2326 (US) | Message me
Site Icon
David's blog -
Tags
Page Views
0
Share
Update Time
2022-05-03 00:31:57

"I love David's blog -"

www.davidlindelof.com VS www.gqak.com

2022-05-03 00:31:57

Primary MenuDavid's blogPagesAbout mePublicationsThesisRecent PostsThe most under-rated programming booksFeature standardization considered harmfulNo, you have not controlled for confoundersA/B testing my resumeUnit testing SQL with PySparkLinksR-bloggersRecent CommentsFeature standardization considered harmful | R-bloggers on Feature standardization considered harmfulDavid Lindelöf on No, you have not controlled for confounderstimbp on No, you have not controlled for confoundersNo, you have not controlled for confounders | R-bloggers on No, you have not controlled for confoundersNo, you have not controlled for confounders – Data Science Austria on No, you have not controlled for confoundersArchivesJune 2021February 2021November 2020October 2020September 2020August 2020November 2019August 2019July 2019May 2019March 2019March 2018July 2016June 2016May 2016April 2016January 2016December 2015November 2015March 2015October 2014September 2014May 2014April 2014March 2014December 2013November 2013July 2013June 2013May 2013April 2013April 2012March 2012February 2012May 2011April 2011January 2011September 2010August 2010July 2010June 2010February 2010December 2009August 2009July 2009June 2009May 2009April 2009March 2009February 2009January 2009December 2008November 2008October 2008September 2008August 2008July 2008June 2008May 2008March 2008October 2007July 2007June 2007May 2007April 2007January 2007December 2006October 2006September 2006March 2006February 2006January 2006December 2005CategoriesAgileAnnouncementsArticlewatchBACnetBook reviewsC++Climate changeEIBEnergyGeneralHome automationInterviewsKNXMathsOSGiProgrammingPythonRResearchSparkToolsTrendsUncategorizedX10MetaLog inEntries feedComments feedWordPress.orgPrivacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use. To find out more, including how to control cookies, see here:Cookie PolicySkip to contentDavid's blogThe most under-rated programming booksAsk any programmer what their favourite programming book is, and their answer will be one of the usual suspects: Code Complete, The Pragmatic Programmer, or Design Patterns. And rightly so; these are outstanding and highly-regarded works that belong to every programmer’s bookshelf. (If you’re just starting out building up your bookshelf, Jeff Atwood has some great recommendations).But once you get past the “essential” books you’ll find that there are many incredibly good programming books out there that people don’t talk much about, but which were essential in taking me to the next levels in my professional growth.Here’s a partial list of such books; I’m sure there are many others, feel free to mention them in the comments.Growing Object-Oriented Software, Guided by Tests Continue reading →Share this:FacebookLinkedInTwitterRedditWhatsAppPosted on June 16, 2021July 7, 2021 by David LindelöfFeature standardization considered harmfulMany statistical learning algorithms perform better when the covariates are on similar scales. For example, it is common practice to standardize the features used by an artificial neural network so that the gradient of its objective function doesn’t depend on the physical units in which the features are described.The same advice is frequently given for K-means clustering (see Do Clustering algorithms need feature scaling in the pre-processing stage?, Are mean normalization and feature scaling needed for k-means clustering?, and In cluster analysis should I scale (standardize) my data if variables are in the same units?), but there’s a great counter-example given in The Elements of Statistical Learning that I try to reproduce here.Consider two point clouds ($n=50$each), randomly drawn around two origins 3 units away from the origin: Continue reading →Share this:FacebookLinkedInTwitterRedditWhatsAppPosted on June 11, 2021July 7, 2021 by David Lindelöf1No, you have not controlled for confoundersWhen observational data includes a treatment indicator and some possible confounders, it is very tempting to simply regress the outcome on all features (confounders and treatment alike), extract the coefficients associated with the treatment indicator, and proudly proclaim that “we have controlled for confounders and estimated the treatment effect”.This approach is wrong. Very wrong. At least as wrong as that DIY electrical job you did last week: it looks all good and neat but you’ve made a critical mistake and there’s no way you can find out without killing yourself.Or worse, by thinking you’ve controlled for confounders when you haven’tI can’t explain why this is wrong (I’m not sure I understand it myself) but I can show you some examples proving that this approach is wrong. We’ll work through a few examples, where we compare the results with a traditional regression with a couple of legit causal inference libraries. Since we use simulated data, we’ll also be able to compare with the “true” treatment effect. Continue reading →Share this:FacebookLinkedInTwitterRedditWhatsAppPosted on February 10, 2021February 22, 2021 by David Lindelöf4A/B testing my resumeInternet wisdom is divided on whether one-page resumes are more effective at landing you an interview than two-page ones. Most of the advice out there seems much opinion- or anecdotal-based, with very little scientific basis.Well, let’s fix that.Being currently open to work, I thought this would be the right time to test this scientifically. I have two versions of my resume:A two-page, employment + education on first page, extra information on the second pagesuch as online courses, hobbies etc.A one-page, dense, responsibilities + achievements only, follows template from theCareer Tools resume workbook.The purpose of a resume is to land you an interview, so we’ll track for each resume how many applications yield a call for an interview. Non-responses after one week are treated as failures. We’ll model the effectiveness of a resume as a binomial distribution: all other things being considered equal, we’ll assume all applications using the same resume type have the same probability ($p1$ or $p2$) of landing an interview. We’d like to estimate these probabilities, and decide if one resume is more effective than the other. Continue reading →Share this:FacebookLinkedInTwitterRedditWhatsAppPosted on November 24, 2020February 22, 2021 by David Lindelöf4Unit testing SQL with PySparkMachine-learning applications frequently feature SQL queries, which range from simple projections to complex aggregations over several join operations.There doesn’t seem to be much guidance on how to verify that these queries are correct. All mainstream programming languages have embraced unit tests as the primary tool to verify the correctness of the language’s smallest building blocks—all, that is, except SQL.And yet, SQL is a programming language and SQL queries are computer programs, which should be tested just like every other unit of the application.I’m not responsibleAll mainstream languages have libraries for writing unit tests: small computer programs that verify that each software module works as expected. But SQL poses a special challenge, as it can be difficult to use SQL to set up a test, execute it, and verify the output. SQL is a declarative language, usually embedded in a “host” programming language—a language in a language.So to unit test SQL we need to use that host language to set up the data tables used in our queries, orchestrate the execution of the SQL queries, and verify the correctness of the results. Continue reading →Share this:FacebookLinkedInTwitterRedditWhatsAppPosted on November 16, 2020February 22, 2021 by David Lindelöf3Scraping real estate for funHere’s a fun weekend project: scrape the real estate classifieds of the website of your choice, and do some analytics on the data. I did just that last weekend, using the Scrapy Python library for web scraping, which I then let loose on one of the major real estate classifieds website in Switzerland (can’t tell you which one—not sure they would love me for it).After about 10 minutes I had the data for 12’124 apartments or houses for sale across Switzerland, with room count, area, price, city, and canton.I’ve imported the data in R, and log-transformed the room count, area, and price because of extreme skewness. Here’s the resulting scatterplot matrix, obtained with ggpairs():There’s a number of interesting features, even from this raw, unclean dataset:there are about twice as many apartments for sale than housesthe room count comes in discrete values in steps of 0.5 (half rooms are frequently used for “smaller” rooms such as a small kitchen, a small hallway, etc)the room count is highly correlated with area, as expectedthe price is more correlated with the area than with the room countthere are several extreme outliers:a property with 290 rooms (was a typo; the owner meant an area of 290 m2)some properties with abnormally low area (one of them was a house with a listed room count of 1 and area of 1 m2—obviously didn’t bother to enter correct data)and more interesting, several properties with abnormally low prices; the lowest-priced item is a 3.5-room, 80 m2 apartment in Fribourg priced at CHF 99.-.Before we go any further, we’ll obviously have to clean up these faulty data points. There doesn’t seem to be many of them so I’ll do that manually, and write a follow-up post if I find anything interesting.Share this:FacebookLinkedInTwitterRedditWhatsAppPosted on November 6, 2020November 6, 2020 by David LindelöfTesting Scientific Software with HypothesisWriting unit tests for scientific software is challenging because frequently you don’t even know what the output should be. Unlike business software, which automates well-understood processes, here you cannot simply work your way through use case after use case, unit test after unit test. Your program is either correct or it isn’t, and you have no way of knowing.If you cannot know if your program is correct, does it mean you cannot test it? No, but you’ll be using other techniques than the ones traditionally used for unit tests. In this piece we’ll look at two techniques you can use to test your software. These are:comparing the output with a reference (an oracle)verifying that the output satisfies certain properties.And we’ll illustrate these two techniques through three examples:the square root functionthe trigonometric functionsa simple Pandas operation Continue reading →Share this:FacebookLinkedInTwitterRedditWhatsAppPosted on October 28, 2020July 7, 2021 by David LindelöfMonty Hall: a programmer’s explanationI take it we’re all familiar with the infamous Monty Hall problem:Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say A, and the host, who knows what’s behind the doors, opens another door, say C, which has a goat. He then says to you, “Do you want to pick door B?” Is it to your advantage to switch your choice?The correct strategy consists in accepting the offer to switch doors. The expected gain when doing so is 2/3, vs 1/3 when choosing not to change doors (there are only two doors left closed, so one of them must have the car; hence the expected gains sum to 1).https://xkcd.com/1282/The mathematical proof is straightforward but most people (including me) have a hard time understanding intuitively what’s going on. There are plenty of explanations available online, but I’ve recently simulated the problem and come across one that I think is simpler and more elegant than anything I’ve seen so far. And I didn’t make it up; the simulation code did. Here’s how.Classic MontyLet’s simulate 10,000 rounds of the game. The car vector will hold the door which has the car behind it:n